2009-10-21 19:20:16

by Valerie Aurora

[permalink] [raw]
Subject: [RFC PATCH 00/40] Writable overlays (union mounts)

Here is the current patch set for writable overlays (union mounts).
It needs lots of review! Especially the bits where we do nasty things
with readdir().

Writable overlays let you mount one read-write file system
transparently over another read-only file system. This is useful for
things like LiveCDs. Detailed documentation and HOWTO here:

http://valerieaurora.org/union/

The git version is in branch "overlay" of:

http://git.kernel.org/?p=linux/kernel/git/val/linux-2.6.git;a=summary

Call for maintainer: Neither Jan nor I can give this patch set the
long term love that it needs. If you are interested in becoming the
maintainer for union mounts at some point in the future, please drop
me an email.

-VAL

Felix Fietkau (2):
whiteout: jffs2 whiteout support
fallthru: jffs2 fallthru support

Jan Blunck (25):
VFS: BUG() if somebody tries to rehash an already hashed dentry
VFS: propagate mnt_flags into do_loopback
VFS: Make lookup_hash() return a struct path
VFS: Remove unnecessary micro-optimization in cached_lookup()
VFS: Make real_lookup() return a struct path
VFS: Introduce dput() variant that maintains a kill-list
Don't replace nameidata path when following links
whiteout: Don't return information about whiteouts to userspace
whiteout: Add vfs_whiteout() and whiteout inode operation
whiteout: Set S_OPAQUE inode flag when creating directories
union-mount: Allow removal of a directory
whiteout: tmpfs whiteout support
whiteout: Split of ext2_append_link() from ext2_add_link()
whiteout: ext2 whiteout support
whiteout: Add path_whiteout() helper
union-mount: Introduce MNT_UNION and MS_UNION flags
union-mount: Introduce union_mount structure
union-mount: Drive the union cache via dcache
union-mount: Some checks during namespace changes
union-mount: Changes to the namespace handling
union-mount: Make lookup work for union-mounted file systems
union-mount: stop lookup when directory has S_OPAQUE flag set
union-mount: stop lookup when finding a whiteout
union-mount: call do_whiteout() on unlink and rmdir
union-mount: Add support for rename by __union_copyup()

Valerie Aurora (14):
VFS: Add read-only users count to superblock
union-mount: Documentation
union-mount: in-kernel file copy between union mounted filesystems
union-mount: Always create topmost directory on open
fallthru: Basic fallthru definitions
fallthru: Support for fallthru entries in union mount lookup
fallthru: ext2 fallthru support
fallthru: tmpfs fallthru support
union-mount: Copy up directory entries on first readdir()
union-mount: Increment read-only users count for read-only layer
union-mount: Check read-only/read-write status of layers
union-mount: Make pivot_root work with union mounts
union-mount: Ignore read-only file system in permission checks
union-mount: Make truncate work in all its glorious UNIX variations

Documentation/filesystems/union-mounts.txt | 708 ++++++++++++++
fs/Kconfig | 13 +
fs/Makefile | 1 +
fs/autofs4/autofs_i.h | 1 +
fs/autofs4/init.c | 11 +-
fs/autofs4/root.c | 6 +
fs/compat.c | 9 +
fs/dcache.c | 143 +++-
fs/ext2/dir.c | 248 +++++-
fs/ext2/ext2.h | 4 +
fs/ext2/inode.c | 11 +-
fs/ext2/namei.c | 85 ++-
fs/ext2/super.c | 7 +
fs/jffs2/dir.c | 108 ++-
fs/jffs2/fs.c | 4 +
fs/jffs2/super.c | 2 +-
fs/libfs.c | 21 +-
fs/namei.c | 1424 +++++++++++++++++++++++++---
fs/namespace.c | 120 +++-
fs/nfsctl.c | 6 +-
fs/nfsd/nfs3xdr.c | 5 +
fs/nfsd/nfs4xdr.c | 2 +-
fs/nfsd/nfsxdr.c | 4 +
fs/open.c | 130 +--
fs/readdir.c | 26 +
fs/super.c | 14 +
fs/union.c | 978 +++++++++++++++++++
include/linux/dcache.h | 29 +
include/linux/ext2_fs.h | 5 +
include/linux/fs.h | 15 +-
include/linux/jffs2.h | 8 +
include/linux/mount.h | 4 +
include/linux/namei.h | 6 +
include/linux/union.h | 84 ++
mm/shmem.c | 195 ++++-
35 files changed, 4172 insertions(+), 265 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt
create mode 100644 fs/union.c
create mode 100644 include/linux/union.h


2009-10-21 19:20:12

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 01/41] VFS: BUG() if somebody tries to rehash an already hashed dentry

From: Jan Blunck <[email protected]>

Break early when somebody tries to rehash an already hashed dentry.
Otherwise this leads to interesting corruptions in the dcache hash table
later on.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9e5cd3c..38bf982 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1550,6 +1550,7 @@ void d_rehash(struct dentry * entry)
{
spin_lock(&dcache_lock);
spin_lock(&entry->d_lock);
+ BUG_ON(!d_unhashed(entry));
_d_rehash(entry);
spin_unlock(&entry->d_lock);
spin_unlock(&dcache_lock);
--
1.6.3.3

2009-10-21 19:20:16

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 02/41] VFS: propagate mnt_flags into do_loopback

From: Jan Blunck <[email protected]>

The mnt_flags are propagated into do_loopback(), so that they can be checked
when mounting something loopback into a union.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 7230787..4cd43ea 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1446,8 +1446,8 @@ static int do_change_type(struct path *path, int flag)
/*
* do loopback mount.
*/
-static int do_loopback(struct path *path, char *old_name,
- int recurse)
+static int do_loopback(struct path *path, char *old_name, int recurse,
+ int mnt_flags)
{
struct path old_path;
struct vfsmount *mnt = NULL;
@@ -1944,7 +1944,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
data_page);
else if (flags & MS_BIND)
- retval = do_loopback(&path, dev_name, flags & MS_REC);
+ retval = do_loopback(&path, dev_name, flags & MS_REC,
+ mnt_flags);
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&path, flags);
else if (flags & MS_MOVE)
--
1.6.3.3

2009-10-21 19:30:34

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 03/41] VFS: Make lookup_hash() return a struct path

From: Jan Blunck <[email protected]>

This patch changes lookup_hash() into returning a struct path.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 114 +++++++++++++++++++++++++++++++----------------------------
1 files changed, 60 insertions(+), 54 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1f13751..e334f25 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1172,7 +1172,7 @@ static int path_lookup_open(int dfd, const char *name,
}

static struct dentry *__lookup_hash(struct qstr *name,
- struct dentry *base, struct nameidata *nd)
+ struct dentry *base, struct nameidata *nd)
{
struct dentry *dentry;
struct inode *inode;
@@ -1219,14 +1219,22 @@ out:
* needs parent already locked. Doesn't follow mounts.
* SMP-safe.
*/
-static struct dentry *lookup_hash(struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+ struct path *path)
{
int err;

err = inode_permission(nd->path.dentry->d_inode, MAY_EXEC);
if (err)
- return ERR_PTR(err);
- return __lookup_hash(&nd->last, nd->path.dentry, nd);
+ return err;
+ path->mnt = nd->path.mnt;
+ path->dentry = __lookup_hash(name, nd->path.dentry, nd);
+ if (IS_ERR(path->dentry)) {
+ err = PTR_ERR(path->dentry);
+ path->dentry = NULL;
+ path->mnt = NULL;
+ }
+ return err;
}

static int __lookup_one_len(const char *name, struct qstr *this,
@@ -1736,12 +1744,10 @@ struct file *do_filp_open(int dfd, const char *pathname,
if (flag & O_EXCL)
nd.flags |= LOOKUP_EXCL;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(&nd);
- path.mnt = nd.path.mnt;
+ error = lookup_hash(&nd, &nd.last, &path);

do_last:
- error = PTR_ERR(path.dentry);
- if (IS_ERR(path.dentry)) {
+ if (error) {
mutex_unlock(&dir->d_inode->i_mutex);
goto exit;
}
@@ -1902,8 +1908,7 @@ do_link:
}
dir = nd.path.dentry;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(&nd);
- path.mnt = nd.path.mnt;
+ error = lookup_hash(&nd, &nd.last, &path);
__putname(nd.last.name);
goto do_last;
}
@@ -1937,7 +1942,8 @@ EXPORT_SYMBOL(filp_open);
*/
struct dentry *lookup_create(struct nameidata *nd, int is_dir)
{
- struct dentry *dentry = ERR_PTR(-EEXIST);
+ struct path path = { .dentry = ERR_PTR(-EEXIST) } ;
+ int err;

mutex_lock_nested(&nd->path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
/*
@@ -1953,11 +1959,13 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
/*
* Do the final lookup.
*/
- dentry = lookup_hash(nd);
- if (IS_ERR(dentry))
+ err = lookup_hash(nd, &nd->last, &path);
+ if (err) {
+ path.dentry = ERR_PTR(err);
goto fail;
+ }

- if (dentry->d_inode)
+ if (path.dentry->d_inode)
goto eexist;
/*
* Special case - lookup gave negative, but... we had foo/bar/
@@ -1966,15 +1974,17 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
* been asking for (non-existent) directory. -ENOENT for you.
*/
if (unlikely(!is_dir && nd->last.name[nd->last.len])) {
- dput(dentry);
- dentry = ERR_PTR(-ENOENT);
+ path_put_conditional(&path, nd);
+ path.dentry = ERR_PTR(-ENOENT);
}
- return dentry;
+ if (nd->path.mnt != path.mnt)
+ mntput(path.mnt);
+ return path.dentry;
eexist:
- dput(dentry);
- dentry = ERR_PTR(-EEXIST);
+ path_put_conditional(&path, nd);
+ path.dentry = ERR_PTR(-EEXIST);
fail:
- return dentry;
+ return path.dentry;
}
EXPORT_SYMBOL_GPL(lookup_create);

@@ -2211,7 +2221,7 @@ static long do_rmdir(int dfd, const char __user *pathname)
{
int error = 0;
char * name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;

error = user_path_parent(dfd, pathname, &nd, &name);
@@ -2233,21 +2243,20 @@ static long do_rmdir(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (IS_ERR(dentry))
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (error)
goto exit2;
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit3;
- error = security_path_rmdir(&nd.path, dentry);
+ error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
- error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
+ error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
exit3:
- dput(dentry);
+ path_put_conditional(&path, &nd);
exit2:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
exit1:
@@ -2302,7 +2311,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
{
int error;
char *name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;
struct inode *inode = NULL;

@@ -2317,26 +2326,25 @@ static long do_unlinkat(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (!error) {
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
- inode = dentry->d_inode;
+ inode = path.dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
- error = security_path_unlink(&nd.path, dentry);
+ error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
- error = vfs_unlink(nd.path.dentry->d_inode, dentry);
+ error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
exit2:
- dput(dentry);
+ path_put_conditional(&path, &nd);
}
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
if (inode)
@@ -2347,8 +2355,8 @@ exit1:
return error;

slashes:
- error = !dentry->d_inode ? -ENOENT :
- S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+ error = !path.dentry->d_inode ? -ENOENT :
+ S_ISDIR(path.dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
goto exit2;
}

@@ -2688,7 +2696,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
{
struct dentry *old_dir, *new_dir;
- struct dentry *old_dentry, *new_dentry;
+ struct path old, new;
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -2722,16 +2730,15 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,

trap = lock_rename(new_dir, old_dir);

- old_dentry = lookup_hash(&oldnd);
- error = PTR_ERR(old_dentry);
- if (IS_ERR(old_dentry))
+ error = lookup_hash(&oldnd, &oldnd.last, &old);
+ if (error)
goto exit3;
/* source must exist */
error = -ENOENT;
- if (!old_dentry->d_inode)
+ if (!old.dentry->d_inode)
goto exit4;
/* unless the source is a directory trailing slashes give -ENOTDIR */
- if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
+ if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
goto exit4;
@@ -2740,32 +2747,31 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
}
/* source should not be ancestor of target */
error = -EINVAL;
- if (old_dentry == trap)
+ if (old.dentry == trap)
goto exit4;
- new_dentry = lookup_hash(&newnd);
- error = PTR_ERR(new_dentry);
- if (IS_ERR(new_dentry))
+ error = lookup_hash(&newnd, &newnd.last, &new);
+ if (error)
goto exit4;
/* target should not be an ancestor of source */
error = -ENOTEMPTY;
- if (new_dentry == trap)
+ if (new.dentry == trap)
goto exit5;

error = mnt_want_write(oldnd.path.mnt);
if (error)
goto exit5;
- error = security_path_rename(&oldnd.path, old_dentry,
- &newnd.path, new_dentry);
+ error = security_path_rename(&oldnd.path, old.dentry,
+ &newnd.path, new.dentry);
if (error)
goto exit6;
- error = vfs_rename(old_dir->d_inode, old_dentry,
- new_dir->d_inode, new_dentry);
+ error = vfs_rename(old_dir->d_inode, old.dentry,
+ new_dir->d_inode, new.dentry);
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
- dput(new_dentry);
+ path_put_conditional(&new, &newnd);
exit4:
- dput(old_dentry);
+ path_put_conditional(&old, &oldnd);
exit3:
unlock_rename(new_dir, old_dir);
exit2:
--
1.6.3.3

2009-10-21 19:31:17

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 04/41] VFS: Remove unnecessary micro-optimization in cached_lookup()

From: Jan Blunck <[email protected]>

d_lookup() takes rename_lock which is a seq_lock. This is so cheap
it's not worth calling lockless __d_lookup() first from
cache_lookup(). Rename cached_lookup() to cache_lookup() while we're
there.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 13 ++++---------
1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e334f25..9c9ecfa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -404,15 +404,10 @@ do_revalidate(struct dentry *dentry, struct nameidata *nd)
* Internal lookup() using the new generic dcache.
* SMP-safe
*/
-static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
+ struct nameidata *nd)
{
- struct dentry * dentry = __d_lookup(parent, name);
-
- /* lockess __d_lookup may fail due to concurrent d_move()
- * in some unrelated directory, so try with d_lookup
- */
- if (!dentry)
- dentry = d_lookup(parent, name);
+ struct dentry *dentry = d_lookup(parent, name);

if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
dentry = do_revalidate(dentry, nd);
@@ -1191,7 +1186,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
goto out;
}

- dentry = cached_lookup(base, name, nd);
+ dentry = cache_lookup(base, name, nd);
if (!dentry) {
struct dentry *new;

--
1.6.3.3

2009-10-21 19:30:55

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 05/41] VFS: Make real_lookup() return a struct path

From: Jan Blunck <[email protected]>

This patch changes real_lookup() into returning a struct path.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 82 +++++++++++++++++++++++++++++++++++++----------------------
1 files changed, 51 insertions(+), 31 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9c9ecfa..a338496 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -462,10 +462,11 @@ ok:
* make sure that nobody added the entry to the dcache in the meantime..
* SMP-safe
*/
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+static int real_lookup(struct nameidata *nd, struct qstr *name,
+ struct path *path)
{
- struct dentry * result;
- struct inode *dir = parent->d_inode;
+ struct inode *dir = nd->path.dentry->d_inode;
+ int res = 0;

mutex_lock(&dir->i_mutex);
/*
@@ -482,27 +483,36 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
*
* so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
*/
- result = d_lookup(parent, name);
- if (!result) {
+ path->dentry = d_lookup(nd->path.dentry, name);
+ path->mnt = nd->path.mnt;
+ if (!path->dentry) {
struct dentry *dentry;

/* Don't create child dentry for a dead directory. */
- result = ERR_PTR(-ENOENT);
- if (IS_DEADDIR(dir))
+ if (IS_DEADDIR(dir)) {
+ res = -ENOENT;
goto out_unlock;
+ }

- dentry = d_alloc(parent, name);
- result = ERR_PTR(-ENOMEM);
+ dentry = d_alloc(nd->path.dentry, name);
if (dentry) {
- result = dir->i_op->lookup(dir, dentry, nd);
- if (result)
+ path->dentry = dir->i_op->lookup(dir, dentry, nd);
+ if (path->dentry) {
dput(dentry);
- else
- result = dentry;
+ if (IS_ERR(path->dentry)) {
+ res = PTR_ERR(path->dentry);
+ path->dentry = NULL;
+ path->mnt = NULL;
+ }
+ } else
+ path->dentry = dentry;
+ } else {
+ res = -ENOMEM;
+ path->mnt = NULL;
}
out_unlock:
mutex_unlock(&dir->i_mutex);
- return result;
+ return res;
}

/*
@@ -510,12 +520,20 @@ out_unlock:
* we waited on the semaphore. Need to revalidate.
*/
mutex_unlock(&dir->i_mutex);
- if (result->d_op && result->d_op->d_revalidate) {
- result = do_revalidate(result, nd);
- if (!result)
- result = ERR_PTR(-ENOENT);
+ if (path->dentry->d_op && path->dentry->d_op->d_revalidate) {
+ path->dentry = do_revalidate(path->dentry, nd);
+ if (!path->dentry) {
+ res = -ENOENT;
+ path->mnt = NULL;
+ }
+ if (IS_ERR(path->dentry)) {
+ res = PTR_ERR(path->dentry);
+ path->dentry = NULL;
+ path->mnt = NULL;
+ }
}
- return result;
+
+ return res;
}

/*
@@ -785,35 +803,37 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
static int do_lookup(struct nameidata *nd, struct qstr *name,
struct path *path)
{
- struct vfsmount *mnt = nd->path.mnt;
- struct dentry *dentry = __d_lookup(nd->path.dentry, name);
+ int err;

- if (!dentry)
+ path->dentry = __d_lookup(nd->path.dentry, name);
+ path->mnt = nd->path.mnt;
+ if (!path->dentry)
goto need_lookup;
- if (dentry->d_op && dentry->d_op->d_revalidate)
+ if (path->dentry->d_op && path->dentry->d_op->d_revalidate)
goto need_revalidate;
+
done:
- path->mnt = mnt;
- path->dentry = dentry;
__follow_mount(path);
return 0;

need_lookup:
- dentry = real_lookup(nd->path.dentry, name, nd);
- if (IS_ERR(dentry))
+ err = real_lookup(nd, name, path);
+ if (err)
goto fail;
goto done;

need_revalidate:
- dentry = do_revalidate(dentry, nd);
- if (!dentry)
+ path->dentry = do_revalidate(path->dentry, nd);
+ if (!path->dentry)
goto need_lookup;
- if (IS_ERR(dentry))
+ if (IS_ERR(path->dentry)) {
+ err = PTR_ERR(path->dentry);
goto fail;
+ }
goto done;

fail:
- return PTR_ERR(dentry);
+ return err;
}

/*
--
1.6.3.3

2009-10-21 19:20:29

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 06/41] VFS: Introduce dput() variant that maintains a kill-list

From: Jan Blunck <[email protected]>

This patch introduces a new variant of dput(). This becomes necessary to
prevent a recursive call to dput() from the union mount code.

void __dput(struct dentry *dentry, struct list_head *list, int greedy);
struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
int greedy);

__dput() works mostly like the original dput() did. The main difference is
that if it the greedy argument is zero it will put the parent on a special
list instead of trying to get rid of it directly.

Therefore the union mount code can safely call __dput() when it wants to get
rid of underlying dentry references during a dput(). After calling __dput()
or __d_kill() the caller must make sure that __d_kill_final() is called on all
dentries on the kill list. __d_kill_final() is actually doing the
dentry_iput() and is also dereferencing the parent.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 105 insertions(+), 10 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 38bf982..3415e9e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -157,14 +157,19 @@ static void dentry_lru_del_init(struct dentry *dentry)
}

/**
- * d_kill - kill dentry and return parent
+ * __d_kill - kill dentry and return parent
* @dentry: dentry to kill
+ * @list: kill list
+ * @greedy: return parent instead of putting it on the kill list
*
* The dentry must already be unhashed and removed from the LRU.
*
- * If this is the root of the dentry tree, return NULL.
+ * If this is the root of the dentry tree, return NULL. If greedy is zero, we
+ * put the parent of this dentry on the kill list instead. The callers must
+ * make sure that __d_kill_final() is called on all dentries on the kill list.
*/
-static struct dentry *d_kill(struct dentry *dentry)
+static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
+ int greedy)
__releases(dentry->d_lock)
__releases(dcache_lock)
{
@@ -172,6 +177,20 @@ static struct dentry *d_kill(struct dentry *dentry)

list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
+
+ /*
+ * If we are not greedy we just put this on a list for later processing
+ * (follow up to parent, releasing of inode and freeing dentry memory).
+ */
+ if (!greedy) {
+ list_del_init(&dentry->d_alias);
+ /* at this point nobody can reach this dentry */
+ list_add(&dentry->d_lru, list);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ return NULL;
+ }
+
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
if (IS_ROOT(dentry))
@@ -182,6 +201,54 @@ static struct dentry *d_kill(struct dentry *dentry)
return parent;
}

+void __dput(struct dentry *, struct list_head *, int);
+
+static void __d_kill_final(struct dentry *dentry, struct list_head *list)
+{
+ struct dentry *parent;
+ struct inode *inode = dentry->d_inode;
+
+ if (inode) {
+ dentry->d_inode = NULL;
+ if (!inode->i_nlink)
+ fsnotify_inoderemove(inode);
+ if (dentry->d_op && dentry->d_op->d_iput)
+ dentry->d_op->d_iput(dentry, inode);
+ else
+ iput(inode);
+ }
+
+ if (IS_ROOT(dentry))
+ parent = NULL;
+ else
+ parent = dentry->d_parent;
+ d_free(dentry);
+ __dput(parent, list, 1);
+}
+
+/**
+ * d_kill - kill dentry and return parent
+ * @dentry: dentry to kill
+ *
+ * The dentry must already be unhashed and removed from the LRU.
+ *
+ * If this is the root of the dentry tree, return NULL.
+ */
+static struct dentry *d_kill(struct dentry *dentry)
+{
+ LIST_HEAD(mortuary);
+ struct dentry *parent;
+
+ parent = __d_kill(dentry, &mortuary, 1);
+ while (!list_empty(&mortuary)) {
+ dentry = list_entry(mortuary.next, struct dentry, d_lru);
+ list_del(&dentry->d_lru);
+ __d_kill_final(dentry, &mortuary);
+ }
+
+ return parent;
+}
+
/*
* This is dput
*
@@ -199,19 +266,24 @@ static struct dentry *d_kill(struct dentry *dentry)
* Real recursion would eat up our stack space.
*/

-/*
- * dput - release a dentry
- * @dentry: dentry to release
+/**
+ * __dput - release a dentry
+ * @dentry: dentry to release
+ * @list: kill list argument for __d_kill()
+ * @greedy: greedy argument for __d_kill()
*
* Release a dentry. This will drop the usage count and if appropriate
* call the dentry unlink method as well as removing it from the queues and
* releasing its resources. If the parent dentries were scheduled for release
- * they too may now get deleted.
+ * they too may now get deleted if @greedy is not zero. Otherwise parent is
+ * added to the kill list. The callers must make sure that __d_kill_final() is
+ * called on all dentries on the kill list.
+ *
+ * You probably want to use dput() instead.
*
* no dcache lock, please.
*/
-
-void dput(struct dentry *dentry)
+void __dput(struct dentry *dentry, struct list_head *list, int greedy)
{
if (!dentry)
return;
@@ -252,12 +324,35 @@ unhash_it:
kill_it:
/* if dentry was on the d_lru list delete it from there */
dentry_lru_del(dentry);
- dentry = d_kill(dentry);
+ dentry = __d_kill(dentry, list, greedy);
if (dentry)
goto repeat;
}

/**
+ * dput - release a dentry
+ * @dentry: dentry to release
+ *
+ * Release a dentry. This will drop the usage count and if appropriate
+ * call the dentry unlink method as well as removing it from the queues and
+ * releasing its resources. If the parent dentries were scheduled for release
+ * they too may now get deleted.
+ *
+ * no dcache lock, please.
+ */
+void dput(struct dentry *dentry)
+{
+ LIST_HEAD(mortuary);
+
+ __dput(dentry, &mortuary, 1);
+ while (!list_empty(&mortuary)) {
+ dentry = list_entry(mortuary.next, struct dentry, d_lru);
+ list_del(&dentry->d_lru);
+ __d_kill_final(dentry, &mortuary);
+ }
+}
+
+/**
* d_invalidate - invalidate a dentry
* @dentry: dentry to invalidate
*
--
1.6.3.3

2009-10-21 19:20:28

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 07/41] VFS: Add read-only users count to superblock

While we can check if a file system is currently read-only, we can't
guarantee that it will stay read-only. The file system can be
remounted read-write at any time; it's also conceivable that a file
system can be mounted a second time and converted to read-write if the
underlying fs allows it. This is a problem for union mounts, which
require the underlying file system be read-only. Add a read-only
users count and don't allow remounts to change the file system to
read-write or read-write mounts if there are any read-only users.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/super.c | 14 ++++++++++++++
include/linux/fs.h | 5 +++++
2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 2761d3e..c8140ac 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -553,6 +553,15 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
}
remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);

+ /* If we are remounting read/write, make sure that none of the
+ users require read-only for correct operation (such as
+ union mounts). */
+ if (remount_rw && sb->s_readonly_users) {
+ printk(KERN_INFO "%s: In use by %d read-only user(s)\n",
+ sb->s_id, sb->s_readonly_users);
+ return -EROFS;
+ }
+
if (sb->s_op->remount_fs) {
retval = sb->s_op->remount_fs(sb, &flags, data);
if (retval)
@@ -889,6 +898,11 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
if (error)
goto out_sb;

+ error = -EROFS;
+ if (!(flags & MS_RDONLY) &&
+ (mnt->mnt_sb->s_readonly_users))
+ goto out_sb;
+
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
up_write(&mnt->mnt_sb->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..5fb7343 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1379,6 +1379,11 @@ struct super_block {
* generic_show_options()
*/
char *s_options;
+
+ /*
+ * Users who require read-only access - e.g., union mounts
+ */
+ int s_readonly_users;
};

extern struct timespec current_fs_time(struct super_block *sb);
--
1.6.3.3

2009-10-21 19:29:52

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 08/41] Don't replace nameidata path when following links

From: Jan Blunck <[email protected]>

For autofs4 the commit 051d381259eb57d6074d02a6ba6e90e744f1a29f introduced
some code that is replacing the path embedded in the nameidata with the
path of the link itself. This was done to have access to the struct
vfsmount in the autofs4_follow_link function. Instead autofs4 should
remember the struct vfsmount when it is mounted.

Signed-off-by: Jan Blunck <[email protected]>
---
fs/autofs4/autofs_i.h | 1 +
fs/autofs4/init.c | 11 ++++++++++-
fs/autofs4/root.c | 6 ++++++
fs/namei.c | 7 ++-----
4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 8f7cdde..db2bfce 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -130,6 +130,7 @@ struct autofs_sb_info {
int reghost_enabled;
int needs_reghost;
struct super_block *sb;
+ struct vfsmount *mnt;
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 9722e4b..5e0dcd7 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -17,7 +17,16 @@
static int autofs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
- return get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
+ struct autofs_sb_info *sbi;
+ int ret;
+
+ ret = get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
+ if (ret)
+ return ret;
+
+ sbi = autofs4_sbi(mnt->mnt_sb);
+ sbi->mnt = mnt;
+ return 0;
}

static struct file_system_type autofs_fs_type = {
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index b96a3c5..cb991b8 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -179,6 +179,12 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
DPRINTK("dentry=%p %.*s oz_mode=%d nd->flags=%d",
dentry, dentry->d_name.len, dentry->d_name.name, oz_mode,
nd->flags);
+
+ dput(nd->path.dentry);
+ mntput(nd->path.mnt);
+ nd->path.mnt = mntget(sbi->mnt);
+ nd->path.dentry = dget(dentry);
+
/*
* For an expire of a covered direct or offset mount we need
* to break out of follow_down() at the autofs mount trigger
diff --git a/fs/namei.c b/fs/namei.c
index a338496..46cf1cb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -636,11 +636,8 @@ static __always_inline int __do_follow_link(struct path *path, struct nameidata
touch_atime(path->mnt, dentry);
nd_set_link(nd, NULL);

- if (path->mnt != nd->path.mnt) {
- path_to_nameidata(path, nd);
- dget(dentry);
- }
- mntget(path->mnt);
+ if (path->mnt == nd->path.mnt)
+ mntget(nd->path.mnt);
cookie = dentry->d_inode->i_op->follow_link(dentry, nd);
error = PTR_ERR(cookie);
if (!IS_ERR(cookie)) {
--
1.6.3.3

2009-10-21 19:28:20

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 09/41] whiteout: Don't return information about whiteouts to userspace

From: Jan Blunck <[email protected]>

The userspace isn't ready for handling another filetype. Therefore this
patch lets readdir() and others skip over the whiteout directory entries
they might find.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/compat.c | 9 +++++++++
fs/nfsd/nfs3xdr.c | 5 +++++
fs/nfsd/nfs4xdr.c | 2 +-
fs/nfsd/nfsxdr.c | 4 ++++
fs/readdir.c | 9 +++++++++
5 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 6d6f98f..43f6102 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -847,6 +847,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
struct compat_old_linux_dirent __user *dirent;
compat_ulong_t d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -918,6 +921,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
compat_ulong_t d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -1007,6 +1013,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
u64 off;

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 01d4ec1..59576d0 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -884,6 +884,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
int elen; /* estimated entry length in words */
int num_entry_words = 0; /* actual number of words */

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset) {
u64 offset64 = offset;

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 2dcc7fe..8c25012 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2263,7 +2263,7 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
__be32 nfserr = nfserr_toosmall;

/* In nfsv4, "." and ".." never make it onto the wire.. */
- if (name && isdotent(name, namlen)) {
+ if (d_type == DT_WHT || (name && isdotent(name, namlen))) {
cd->common.err = nfs_ok;
return 0;
}
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index afd08e2..a7d622c 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -513,6 +513,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
namlen, name, offset, ino);
*/

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
if (offset > ~((u32) 0)) {
cd->common.err = nfserr_fbig;
return -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..3a48491 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
struct old_linux_dirent __user * dirent;
unsigned long d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
unsigned long d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
--
1.6.3.3

2009-10-21 19:20:36

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 10/41] whiteout: Add vfs_whiteout() and whiteout inode operation

From: Jan Blunck <[email protected]>

Simply white-out a given directory entry. This functionality is usually used
in the sense of unlink. Therefore the given dentry can still be in-use and
contains an in-use inode. The filesystems inode operation has to do what
unlink or rmdir would in that case. Since the dentry still might be in-use
we have to provide a fresh unhashed dentry that is used as the whiteout
dentry instead. The given dentry is dropped and the whiteout dentry is
rehashed instead.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 4 +-
fs/namei.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 6 +++
include/linux/fs.h | 3 +
4 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3415e9e..0fcae4b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1076,8 +1076,10 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
/* the caller must hold dcache_lock */
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
- if (inode)
+ if (inode) {
+ dentry->d_flags &= ~DCACHE_WHITEOUT;
list_add(&dentry->d_alias, &inode->i_dentry);
+ }
dentry->d_inode = inode;
fsnotify_d_instantiate(dentry, inode);
}
diff --git a/fs/namei.c b/fs/namei.c
index 46cf1cb..d2fc8c9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2169,6 +2169,110 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
return sys_mkdirat(AT_FDCWD, pathname, mode);
}

+
+/* Checks on the victim for whiteout */
+static inline int may_whiteout(struct inode *dir, struct dentry *victim,
+ int isdir)
+{
+ int err;
+
+ /* from may_create() */
+ if (IS_DEADDIR(dir))
+ return -ENOENT;
+ err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+ if (err)
+ return err;
+
+ /* from may_delete() */
+ if (IS_APPEND(dir))
+ return -EPERM;
+ if (!victim->d_inode)
+ return 0;
+ if (check_sticky(dir, victim->d_inode) ||
+ IS_APPEND(victim->d_inode) ||
+ IS_IMMUTABLE(victim->d_inode))
+ return -EPERM;
+ if (isdir) {
+ if (!S_ISDIR(victim->d_inode->i_mode))
+ return -ENOTDIR;
+ if (IS_ROOT(victim))
+ return -EBUSY;
+ } else if (S_ISDIR(victim->d_inode->i_mode))
+ return -EISDIR;
+ if (victim->d_flags & DCACHE_NFSFS_RENAMED)
+ return -EBUSY;
+ return 0;
+}
+
+/**
+ * vfs_whiteout: creates a white-out for the given directory entry
+ * @dir: parent inode
+ * @dentry: directory entry to white-out
+ *
+ * Simply white-out a given directory entry. This functionality is usually used
+ * in the sense of unlink. Therefore the given dentry can still be in-use and
+ * contains an in-use inode. The filesystem has to do what unlink or rmdir
+ * would in that case. Since the dentry still might be in-use we have to
+ * provide a fresh unhashed dentry that whiteout can fill the new inode into.
+ * In that case the given dentry is dropped and the fresh dentry containing the
+ * whiteout is rehashed instead. If the given dentry is unused, the whiteout
+ * inode is instantiated into it instead.
+ *
+ * After this returns with success, don't make any assumptions about the inode.
+ * Just dput() it dentry.
+ */
+int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
+{
+ int err;
+ struct inode *old_inode = dentry->d_inode;
+ struct dentry *parent, *whiteout;
+
+ err = may_whiteout(dir, dentry, isdir);
+ if (err)
+ return err;
+
+ BUG_ON(dentry->d_parent->d_inode != dir);
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ if (old_inode) {
+ vfs_dq_init(dir);
+
+ mutex_lock(&old_inode->i_mutex);
+ if (isdir)
+ dentry_unhash(dentry);
+ if (d_mountpoint(dentry))
+ err = -EBUSY;
+ else {
+ if (isdir)
+ err = security_inode_rmdir(dir, dentry);
+ else
+ err = security_inode_unlink(dir, dentry);
+ }
+ }
+
+ parent = dget_parent(dentry);
+ whiteout = d_alloc_name(parent, dentry->d_name.name);
+
+ if (!err)
+ err = dir->i_op->whiteout(dir, dentry, whiteout);
+
+ if (old_inode) {
+ mutex_unlock(&old_inode->i_mutex);
+ if (!err) {
+ fsnotify_link_count(old_inode);
+ d_delete(dentry);
+ }
+ if (isdir)
+ dput(dentry);
+ }
+
+ dput(whiteout);
+ dput(parent);
+ return err;
+}
+
/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 30b93b2..7648b49 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -183,6 +183,7 @@ d_iput: no no no yes
#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched by inotify */

#define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
+#define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */

#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */

@@ -358,6 +359,11 @@ static inline int d_unlinked(struct dentry *dentry)
return d_unhashed(dentry) && !IS_ROOT(dentry);
}

+static inline int d_is_whiteout(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_WHITEOUT);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5fb7343..04a9870 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -205,6 +205,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_WHITEOUT (1<<26) /* fs does support white-out filetype */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1422,6 +1423,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_whiteout(struct inode *, struct dentry *, int);

/*
* VFS dentry helper functions.
@@ -1526,6 +1528,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2009-10-21 19:20:32

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 11/41] whiteout: Set S_OPAQUE inode flag when creating directories

From: Jan Blunck <[email protected]>

In case of an union directory we don't want that the directories on lower
layers of the union "show through". So to prevent that the contents of
underlying directories magically shows up after a mkdir() we set the S_OPAQUE
flag if directories are created where a whiteout existed before.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 12 +++++++++++-
include/linux/fs.h | 3 +++
2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d2fc8c9..5da1635 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2108,6 +2108,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev)
int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int error = may_create(dir, dentry);
+ int opaque = 0;

if (error)
return error;
@@ -2121,9 +2122,18 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
return error;

vfs_dq_init(dir);
+
+ if (d_is_whiteout(dentry))
+ opaque = 1;
+
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
+ if (!error) {
fsnotify_mkdir(dir, dentry);
+ if (opaque) {
+ dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(dentry->d_inode);
+ }
+ }
return error;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 04a9870..b741e50 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -232,6 +232,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_OPAQUE 1024 /* Directory is opaque */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -267,6 +268,8 @@ struct inodes_stat_t {
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)

+#define IS_OPAQUE(inode) ((inode)->i_flags & S_OPAQUE)
+
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */

--
1.6.3.3

2009-10-21 19:20:40

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 12/41] union-mount: Allow removal of a directory

From: Jan Blunck <[email protected]>

do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.

XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
".."). Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.

Fixes:
- Add ->is_directory_empty() op
- Add is_directory_empty flag to dentry (ugly dcache populate)
- Ask underlying fs to remove it and look for an error return
- (your idea here)

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5da1635..9a62c75 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2284,6 +2284,91 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
}

/*
+ * This is abusing readdir to check if a union directory is logically empty.
+ * Al Viro barfed when he saw this, but Val said: "Well, at this point I'm
+ * aiming for working, pretty can come later"
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ int *is_empty = (int *)__buf;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ (*is_empty) = 0;
+ return 0;
+}
+
+static int directory_is_empty(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct file *file;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+ /* references for the file pointer */
+ dget(dentry);
+ mntget(mnt);
+
+ file = dentry_open(dentry, mnt, O_RDONLY, current_cred());
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
+{
+ struct path safe = { .dentry = dget(nd->path.dentry),
+ .mnt = mntget(nd->path.mnt) };
+ struct dentry *dentry = path->dentry;
+ int err;
+
+ err = may_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+ if (err)
+ goto out;
+
+ err = -ENOTEMPTY;
+ if (isdir && !directory_is_empty(path->dentry, path->mnt))
+ goto out;
+
+ if (nd->path.dentry != dentry->d_parent) {
+ dentry = __lookup_hash(&path->dentry->d_name, nd->path.dentry,
+ nd);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out;
+
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
+
+ err = vfs_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+
+out:
+ path_put(&safe);
+ return err;
+}
+
+/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning
--
1.6.3.3

2009-10-21 19:29:04

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 13/41] whiteout: tmpfs whiteout support

From: Jan Blunck <[email protected]>

Add support for whiteout dentries to tmpfs.

XXX - Not sure this is the right patch to put the code for supporting
whiteouts in d_genocide().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 3 +-
mm/shmem.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0fcae4b..1fae1df 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2280,7 +2280,8 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- if (d_unhashed(dentry)||!dentry->d_inode)
+ if (d_unhashed(dentry)||(!dentry->d_inode &&
+ !d_is_whiteout(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..2faa14b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1794,6 +1794,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}

+static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
+static int shmem_unlink(struct inode *dir, struct dentry *dentry);
+
+/*
+ * This is the whiteout support for tmpfs. It uses one singleton whiteout
+ * inode per superblock thus it is very similar to shmem_link().
+ */
+static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+ struct dentry *dentry;
+
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ /* This gives us a proper initialized negative dentry */
+ dentry = simple_lookup(dir, new_dentry, NULL);
+ if (dentry && IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ /*
+ * No ordinary (disk based) filesystem counts whiteouts as inodes;
+ * but each new link needs a new dentry, pinning lowmem, and
+ * tmpfs dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ if (old_dentry->d_inode) {
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ shmem_rmdir(dir, old_dentry);
+ else
+ shmem_unlink(dir, old_dentry);
+ }
+
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ /* Extra pinning count for the created dentry */
+ dget(new_dentry);
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode)
+{
+ if (d_is_whiteout(dentry)) {
+ /* Re-using an existing whiteout */
+ shmem_free_inode(dir->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ inode->i_mode |= S_OPAQUE;
+ } else {
+ /* New dentry */
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dget(dentry); /* Extra count - pin the dentry in core */
+ }
+ /* Will clear DCACHE_WHITEOUT flag */
+ d_instantiate(dentry, inode);
+
+}
/*
* File creation. Allocate an inode, and we're done..
*/
@@ -1823,10 +1893,10 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
if (S_ISDIR(mode))
inode->i_mode |= S_ISGID;
}
- dir->i_size += BOGO_DIRENT_SIZE;
+
+ shmem_d_instantiate(dir, dentry, inode);
+
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry); /* Extra count - pin the dentry in core */
}
return error;
}
@@ -1864,12 +1934,11 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
if (ret)
goto out;

- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
+
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
atomic_inc(&inode->i_count); /* New dentry reference */
- dget(dentry); /* Extra pinning count for the created dentry */
- d_instantiate(dentry, inode);
out:
return ret;
}
@@ -1878,21 +1947,61 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
- shmem_free_inode(inode->i_sb);
+ if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ shmem_free_inode(dir->i_sb);

+ if (inode) {
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ drop_nlink(inode);
+ }
dir->i_size -= BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
return 0;
}

+static void shmem_dir_unlink_whiteouts(struct inode *dir, struct dentry *dentry)
+{
+ if (!dentry->d_inode)
+ return;
+
+ /* Remove whiteouts from logical empty directory */
+ if (S_ISDIR(dentry->d_inode->i_mode) &&
+ dentry->d_inode->i_sb->s_flags & MS_WHITEOUT) {
+ struct dentry *child, *next;
+ LIST_HEAD(list);
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+ spin_lock(&child->d_lock);
+ if (d_is_whiteout(child)) {
+ __d_drop(child);
+ if (!list_empty(&child->d_lru)) {
+ list_del(&child->d_lru);
+ dentry_stat.nr_unused--;
+ }
+ list_add(&child->d_lru, &list);
+ }
+ spin_unlock(&child->d_lock);
+ }
+ spin_unlock(&dcache_lock);
+
+ list_for_each_entry_safe(child, next, &list, d_lru) {
+ spin_lock(&child->d_lock);
+ list_del_init(&child->d_lru);
+ spin_unlock(&child->d_lock);
+
+ shmem_unlink(dentry->d_inode, child);
+ }
+ }
+}
+
static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
if (!simple_empty(dentry))
return -ENOTEMPTY;

+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(dir, dentry);
drop_nlink(dentry->d_inode);
drop_nlink(dir);
return shmem_unlink(dir, dentry);
@@ -1901,7 +2010,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
/*
* The VFS layer already does all the dentry stuff for rename,
* we just have to decrement the usage count for the target if
- * it exists so that the VFS layer correctly free's it when it
+ * it exists so that the VFS layer correctly frees it when it
* gets overwritten.
*/
static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
@@ -1912,7 +2021,12 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
if (!simple_empty(new_dentry))
return -ENOTEMPTY;

+ if (d_is_whiteout(new_dentry))
+ shmem_unlink(new_dir, new_dentry);
+
if (new_dentry->d_inode) {
+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(new_dir, new_dentry);
(void) shmem_unlink(new_dir, new_dentry);
if (they_are_dirs)
drop_nlink(old_dir);
@@ -1977,12 +2091,12 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s
set_page_dirty(page);
page_cache_release(page);
}
+
+ shmem_d_instantiate(dir, dentry, inode);
+
if (dir->i_mode & S_ISGID)
inode->i_gid = dir->i_gid;
- dir->i_size += BOGO_DIRENT_SIZE;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry);
return 0;
}

@@ -2363,6 +2477,12 @@ static int shmem_fill_super(struct super_block *sb,
if (!root)
goto failed_iput;
sb->s_root = root;
+
+#ifdef CONFIG_TMPFS
+ if (!(sb->s_flags & MS_NOUSER))
+ sb->s_flags |= MS_WHITEOUT;
+#endif
+
return 0;

failed_iput:
@@ -2462,6 +2582,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.rmdir = shmem_rmdir,
.mknod = shmem_mknod,
.rename = shmem_rename,
+ .whiteout = shmem_whiteout,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2009-10-21 19:28:46

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 14/41] whiteout: Split of ext2_append_link() from ext2_add_link()

From: Jan Blunck <[email protected]>

The ext2_append_link() is later used to find or append a directory
entry to whiteout.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 70 ++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6cde970..cb8ceff 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
}

/*
- * Parent is locked.
+ * Find or append a given dentry to the parent directory
*/
-int ext2_add_link (struct dentry *dentry, struct inode *inode)
+static ext2_dirent * ext2_append_entry(struct dentry * dentry,
+ struct page ** page)
{
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
@@ -482,13 +483,10 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
unsigned chunk_size = ext2_chunk_size(dir);
unsigned reclen = EXT2_DIR_REC_LEN(namelen);
unsigned short rec_len, name_len;
- struct page *page = NULL;
- ext2_dirent * de;
+ ext2_dirent * de = NULL;
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
- loff_t pos;
- int err;

/*
* We take care of directory expansion in the same loop.
@@ -498,20 +496,19 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
for (n = 0; n <= npages; n++) {
char *dir_end;

- page = ext2_get_page(dir, n, 0);
- err = PTR_ERR(page);
- if (IS_ERR(page))
+ *page = ext2_get_page(dir, n, 0);
+ de = ERR_PTR(PTR_ERR(*page));
+ if (IS_ERR(*page))
goto out;
- lock_page(page);
- kaddr = page_address(page);
+ lock_page(*page);
+ kaddr = page_address(*page);
dir_end = kaddr + ext2_last_byte(dir, n);
de = (ext2_dirent *)kaddr;
kaddr += PAGE_CACHE_SIZE - reclen;
while ((char *)de <= kaddr) {
if ((char *)de == dir_end) {
/* We hit i_size */
- name_len = 0;
- rec_len = chunk_size;
+ de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
@@ -519,12 +516,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
if (de->rec_len == 0) {
ext2_error(dir->i_sb, __func__,
"zero-length directory entry");
- err = -EIO;
+ de = ERR_PTR(-EIO);
goto out_unlock;
}
- err = -EEXIST;
if (ext2_match (namelen, name, de))
- goto out_unlock;
+ goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
@@ -533,13 +529,48 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
goto got_it;
de = (ext2_dirent *) ((char *) de + rec_len);
}
- unlock_page(page);
- ext2_put_page(page);
+ unlock_page(*page);
+ ext2_put_page(*page);
}
+
BUG();
- return -EINVAL;

got_it:
+ return de;
+ /* OFFSET_CACHE */
+out_unlock:
+ unlock_page(*page);
+ ext2_put_page(*page);
+out:
+ return de;
+}
+
+/*
+ * Parent is locked.
+ */
+int ext2_add_link (struct dentry *dentry, struct inode *inode)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+got_it:
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
@@ -563,7 +594,6 @@ got_it:
/* OFFSET_CACHE */
out_put:
ext2_put_page(page);
-out:
return err;
out_unlock:
unlock_page(page);
--
1.6.3.3

2009-10-21 19:20:55

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 15/41] whiteout: ext2 whiteout support

From: Jan Blunck <[email protected]>

This patch adds whiteout support to EXT2. A whiteout is an empty directory
entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
allocates space in directories. Due to being implemented as a filetype it is
necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.

XXX - Whiteouts could be implemented as special symbolic links

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 96 +++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 3 +
fs/ext2/inode.c | 11 ++++-
fs/ext2/namei.c | 65 ++++++++++++++++++++++++++++++-
fs/ext2/super.c | 7 +++
include/linux/ext2_fs.h | 4 ++
6 files changed, 176 insertions(+), 10 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index cb8ceff..d4628c0 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT))
return 0;
return !memcmp(name, de->name, len);
}
@@ -255,6 +255,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};

#define S_SHIFT 12
@@ -448,6 +449,26 @@ ino_t ext2_inode_by_name(struct inode *dir, struct qstr *child)
return res;
}

+/* Special version for filetype based whiteout support */
+ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct ext2_dir_entry_2 *de;
+ struct page *page;
+
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (de) {
+ res = le32_to_cpu(de->inode);
+ if (!res && de->file_type == EXT2_FT_WHT) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&dentry->d_lock);
+ }
+ ext2_put_page(page);
+ }
+ return res;
+}
+
/* Releases the page */
void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode, int update_times)
@@ -523,7 +544,8 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
- if (!de->inode && rec_len >= reclen)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
goto got_it;
@@ -564,8 +586,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
return PTR_ERR(de);

err = -EEXIST;
- if (ext2_match (namelen, name, de))
+ if (ext2_match (namelen, name, de)) {
+ if (de->file_type == EXT2_FT_WHT)
+ goto got_it;
goto out_unlock;
+ }

got_it:
name_len = EXT2_DIR_REC_LEN(de->name_len);
@@ -577,7 +602,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode) {
+ if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
@@ -646,6 +672,68 @@ out:
return err;
}

+int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
+ struct ext2_dir_entry_2 * de, struct page * page)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ loff_t pos;
+ int err;
+
+ if (!de) {
+ de = ext2_append_entry(dentry, &page);
+ BUG_ON(!de);
+ }
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de) &&
+ (de->file_type == EXT2_FT_WHT)) {
+ ext2_error(dir->i_sb, __func__,
+ "entry is already a whiteout in directory #%lu",
+ dir->i_ino);
+ goto out_unlock;
+ }
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ /*
+ * We whiteout an existing entry. Do what ext2_delete_entry() would do,
+ * except that we don't need to merge with the previous entry since
+ * we are going to reuse it.
+ */
+ if (ext2_match (namelen, name, de))
+ de->inode = 0;
+ if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_WHT;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
/*
* Set the first fragment of directory.
*/
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 9a8a8e2..a7f057f 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -102,9 +102,12 @@ extern void ext2_rsv_window_add(struct super_block *sb, struct ext2_reserve_wind
/* dir.c */
extern int ext2_add_link (struct dentry *, struct inode *);
extern ino_t ext2_inode_by_name(struct inode *, struct qstr *);
+extern ino_t ext2_inode_by_dentry(struct inode *, struct dentry *);
extern int ext2_make_empty(struct inode *, struct inode *);
extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *, struct page **);
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_whiteout_entry (struct inode *, struct dentry *,
+ struct ext2_dir_entry_2 *, struct page *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e271303..5f76e44 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1176,7 +1176,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|
+ S_OPAQUE);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1187,6 +1188,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT2_OPAQUE_FL)
+ inode->i_flags |= S_OPAQUE;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
@@ -1194,8 +1197,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
{
unsigned int flags = ei->vfs_inode.i_flags;

- ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|
- EXT2_IMMUTABLE_FL|EXT2_NOATIME_FL|EXT2_DIRSYNC_FL);
+ ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|EXT2_IMMUTABLE_FL|
+ EXT2_NOATIME_FL|EXT2_DIRSYNC_FL|EXT2_OPAQUE_FL);
if (flags & S_SYNC)
ei->i_flags |= EXT2_SYNC_FL;
if (flags & S_APPEND)
@@ -1206,6 +1209,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
ei->i_flags |= EXT2_NOATIME_FL;
if (flags & S_DIRSYNC)
ei->i_flags |= EXT2_DIRSYNC_FL;
+ if (flags & S_OPAQUE)
+ ei->i_flags |= EXT2_OPAQUE_FL;
}

struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 78d9b92..9c4eef2 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -54,15 +54,16 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
* Methods themselves.
*/

-static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
+static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry,
+ struct nameidata *nd)
{
struct inode * inode;
ino_t ino;
-
+
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);

- ino = ext2_inode_by_name(dir, &dentry->d_name);
+ ino = ext2_inode_by_dentry(dir, dentry);
inode = NULL;
if (ino) {
inode = ext2_iget(dir->i_sb, ino);
@@ -230,6 +231,10 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode)
else
inode->i_mapping->a_ops = &ext2_aops;

+ /* if we call mkdir on a whiteout create an opaque directory */
+ if (dentry->d_flags & DCACHE_WHITEOUT)
+ inode->i_flags |= S_OPAQUE;
+
inode_inc_link_count(inode);

err = ext2_make_empty(inode, dir);
@@ -293,6 +298,59 @@ static int ext2_rmdir (struct inode * dir, struct dentry *dentry)
return err;
}

+/*
+ * Create a whiteout for the dentry
+ */
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
+ struct dentry *new_dentry)
+{
+ struct inode * inode = dentry->d_inode;
+ struct ext2_dir_entry_2 * de = NULL;
+ struct page * page;
+ int err = -ENOTEMPTY;
+
+ if (!EXT2_HAS_INCOMPAT_FEATURE(dir->i_sb,
+ EXT2_FEATURE_INCOMPAT_FILETYPE)) {
+ ext2_error (dir->i_sb, "ext2_whiteout",
+ "can't set whiteout filetype");
+ err = -EPERM;
+ goto out;
+ }
+
+ if (inode) {
+ if (S_ISDIR(inode->i_mode) && !ext2_empty_dir(inode))
+ goto out;
+
+ err = -ENOENT;
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (!de)
+ goto out;
+ lock_page(page);
+ }
+
+ err = ext2_whiteout_entry (dir, dentry, de, page);
+ if (err)
+ goto out;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (inode) {
+ inode->i_ctime = dir->i_ctime;
+ inode_dec_link_count(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ inode->i_size = 0;
+ inode_dec_link_count(inode);
+ inode_dec_link_count(dir);
+ }
+ }
+ err = 0;
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -392,6 +450,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 1a9ffee..c414c6d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1062,6 +1062,13 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
ext2_warning(sb, __func__,
"mounting ext3 filesystem as ext2");
+
+ /*
+ * Whiteouts (and fallthrus) require explicit whiteout support.
+ */
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
+ sb->s_flags |= MS_WHITEOUT;
+
ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
return 0;

diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 121720d..bd10826 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -189,6 +189,7 @@ struct ext2_group_desc
#define EXT2_NOTAIL_FL FS_NOTAIL_FL /* file tail should not be merged */
#define EXT2_DIRSYNC_FL FS_DIRSYNC_FL /* dirsync behaviour (directories only) */
#define EXT2_TOPDIR_FL FS_TOPDIR_FL /* Top of directory hierarchies*/
+#define EXT2_OPAQUE_FL 0x00040000
#define EXT2_RESERVED_FL FS_RESERVED_FL /* reserved for ext2 lib */

#define EXT2_FL_USER_VISIBLE FS_FL_USER_VISIBLE /* User visible flags */
@@ -503,10 +504,12 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff

#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT2_FEATURE_INCOMPAT_WHITEOUT| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -573,6 +576,7 @@ enum {
EXT2_FT_FIFO,
EXT2_FT_SOCK,
EXT2_FT_SYMLINK,
+ EXT2_FT_WHT,
EXT2_FT_MAX
};

--
1.6.3.3

2009-10-21 19:20:44

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 16/41] whiteout: jffs2 whiteout support

From: Felix Fietkau <[email protected]>

Add support for whiteout dentries to jffs2.

Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
---
fs/jffs2/dir.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/jffs2/fs.c | 4 ++
fs/jffs2/super.c | 2 +-
include/linux/jffs2.h | 2 +
4 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 6f60cc9..46a2e1b 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

+static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+
const struct file_operations jffs2_dir_operations =
{
.read = generic_read_dir,
@@ -55,6 +57,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rmdir = jffs2_rmdir,
.mknod = jffs2_mknod,
.rename = jffs2_rename,
+ .whiteout = jffs2_whiteout,
.permission = jffs2_permission,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
@@ -98,8 +101,18 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
fd = fd_list;
}
}
- if (fd)
- ino = fd->ino;
+ if (fd) {
+ spin_lock(&target->d_lock);
+ switch(fd->type) {
+ case DT_WHT:
+ target->d_flags |= DCACHE_WHITEOUT;
+ break;
+ default:
+ ino = fd->ino;
+ break;
+ }
+ spin_unlock(&target->d_lock);
+ }
mutex_unlock(&dir_f->sem);
if (ino) {
inode = jffs2_iget(dir_i->i_sb, ino);
@@ -498,6 +511,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, int mode)
return PTR_ERR(inode);
}

+ if (dentry->d_flags & DCACHE_WHITEOUT) {
+ inode->i_flags |= S_OPAQUE;
+ ri->flags = cpu_to_je16(JFFS2_INO_FLAG_OPAQUE);
+ }
+
inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;

@@ -779,6 +797,61 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return 0;
}

+static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ struct jffs2_inode_info *victim_f = NULL;
+ uint32_t now;
+ int ret;
+
+ /* If it's a directory, then check whether it is really empty
+ */
+ if (new_dentry->d_inode) {
+ victim_f = JFFS2_INODE_INFO(old_dentry->d_inode);
+ if (S_ISDIR(old_dentry->d_inode->i_mode)) {
+ struct jffs2_full_dirent *fd;
+
+ mutex_lock(&victim_f->sem);
+ for (fd = victim_f->dents; fd; fd = fd->next) {
+ if (fd->ino) {
+ mutex_unlock(&victim_f->sem);
+ return -ENOTEMPTY;
+ }
+ }
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_WHT,
+ new_dentry->d_name.name, new_dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (victim_f) {
+ /* There was a victim. Kill it off nicely */
+ drop_nlink(old_dentry->d_inode);
+ /* Don't oops if the victim was a dirent pointing to an
+ inode which didn't exist. */
+ if (victim_f->inocache) {
+ mutex_lock(&victim_f->sem);
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ victim_f->inocache->pino_nlink = 0;
+ else
+ victim_f->inocache->pino_nlink--;
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ return 0;
+}
+
static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
struct inode *new_dir_i, struct dentry *new_dentry)
{
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 3451a81..c1e333c 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -301,6 +301,10 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned long ino)

inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;
+
+ if (je16_to_cpu(latest_node.flags) & JFFS2_INO_FLAG_OPAQUE)
+ inode->i_flags |= S_OPAQUE;
+
break;
}
case S_IFREG:
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 0035c02..6607f0b 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -172,7 +172,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)

sb->s_op = &jffs2_super_operations;
sb->s_export_op = &jffs2_export_ops;
- sb->s_flags = sb->s_flags | MS_NOATIME;
+ sb->s_flags = sb->s_flags | MS_NOATIME | MS_WHITEOUT;
sb->s_xattr = jffs2_xattr_handlers;
#ifdef CONFIG_JFFS2_FS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 2b32d63..65533bb 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -87,6 +87,8 @@
#define JFFS2_INO_FLAG_USERCOMPR 2 /* User has requested a specific
compression type */

+#define JFFS2_INO_FLAG_OPAQUE 4 /* Directory is opaque (for union mounts) */
+

/* These can go once we've made sure we've caught all uses without
byteswapping */
--
1.6.3.3

2009-10-21 19:20:59

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 17/41] whiteout: Add path_whiteout() helper

From: Jan Blunck <[email protected]>

Add a path_whiteout() helper for vfs_whiteout().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 15 ++++++++++++++-
include/linux/fs.h | 1 -
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9a62c75..408380d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2231,7 +2231,7 @@ static inline int may_whiteout(struct inode *dir, struct dentry *victim,
* After this returns with success, don't make any assumptions about the inode.
* Just dput() it dentry.
*/
-int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
+static int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
{
int err;
struct inode *old_inode = dentry->d_inode;
@@ -2283,6 +2283,19 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
return err;
}

+int path_whiteout(struct path *dir_path, struct dentry *dentry, int isdir)
+{
+ int error = mnt_want_write(dir_path->mnt);
+
+ if (!error) {
+ error = vfs_whiteout(dir_path->dentry->d_inode, dentry, isdir);
+ mnt_drop_write(dir_path->mnt);
+ }
+
+ return error;
+}
+EXPORT_SYMBOL(path_whiteout);
+
/*
* This is abusing readdir to check if a union directory is logically empty.
* Al Viro barfed when he saw this, but Val said: "Well, at this point I'm
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b741e50..d13de8a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1426,7 +1426,6 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
-extern int vfs_whiteout(struct inode *, struct dentry *, int);

/*
* VFS dentry helper functions.
--
1.6.3.3

2009-10-21 19:20:58

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 18/41] union-mount: Documentation

Document design and implementation of writable overlays (a.k.a. union
mounts).

Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/union-mounts.txt | 708 ++++++++++++++++++++++++++++
1 files changed, 708 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..5f47296
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,708 @@
+State of writable overlays (formerly union mounts)
+==================================================
+
+This version of union mounts is renamed "writable overlays." The goal
+of this patch set is to support a single read-write file system
+overlaid on a single read-only file system. "Union mounts" suggests
+that we support unions of arbitrary numbers and types of file systems,
+which is not the goal of this patch set.
+
+The most recent version of writable overlays can boot to multi-user
+mode with a writable overlay root file system. open(), truncate(),
+creat(), unlink(), mkdir(), rmdir(), and rename() work. link(),
+chmod(), chown(), and chattr() don't work yet.
+
+This document describes the architecture and current status of
+writable overlays, including an item-by-item todo list.
+
+Writable overlays (formerly union mounts)
+=========================================
+
+In this document:
+ - Overview of writable overlays
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to writable overlays
+
+Overview
+========
+
+Writable overlays (formerly known as union mounts) are used to layer a
+single writable file system over a single read-only file system, with
+all writes going to the writable file system. The namespace of both
+file systems appears as a combined whole to userland, with those on
+the writable file system covering up any matching pathnames on the
+read-only file system. A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device; however, sharing at the file system level has higher
+performance and uses less disk space.
+
+What writable overlays are not
+------------------------------
+
+Writable overlays are not a general-purpose unioning file system.
+They do not provide a generic "union of namespaces" operation for an
+arbitrary number of file systems. Many interesting features can be
+implemented with a generic unioning facility: unioning of more than
+two file systems, dynamic insertion and removal of branches, online
+upgrade, etc. Some unioning file systems that do this are UnionFS and
+AUFS. Unfortunately, the complexity of these feature sets lead to
+difficult corner cases which so far have been unsolvable in the
+context of the Linux VFS.
+
+Writable overlays avoid these corner cases by reducing the feature set
+to the bare minimum most requested features: one writable file system
+layered over one read-only file system. Despite the limitations of
+writable overlays, the VFS infrastructure it uses are generic enough
+to be reused by more full-featured unioning file systems.
+
+Terminology
+===========
+
+The main analogy for writable overlays is that a writable file system
+is mounted "on top" of a read-only file system. Lookups start at the
+"top" read-write file system and travel "down" to the "bottom"
+read-only file system only if no blocking entry exists on the top
+layer.
+
+Top layer: The read-write file system. Lookups begin here.
+
+Bottom layer: The read-only file system. Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding
+path on the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding
+path on the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups
+from travelling down to the bottom layer. Created on unlink()/rmdir()
+if a corresponding directory entry exists in the bottom layer.
+
+Opaque: A flag on a directory in the top layer that prevents lookups
+of entries in this directory from travelling down to the bottom
+layer (unless there is an explicit fallthru entry allowing that for a
+particular entry). Set on creation of a directory that replaces a
+whiteout, and after a directory copyup.
+
+Fallthru: A directory entry which allows lookups to "fall through" to
+the bottom layer for that exact directory entry. This serves as a
+placeholder for directory entries from the bottom layer during
+readdir(). Fallthrus override opaque flags.
+
+File copyup: Create a file on the top layer that has the same properties
+and contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the
+bottom layer as fallthrus in the matching top layer directory. Mark
+the directory opaque to avoid unnecessary negative lookups on the
+bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on top layer
+- unlink() /oldfile -> creates a whiteout on top layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to top layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to top layer
+- mkdir() /newdir -> creates on top layer
+- rmdir() /olddir -> creates a whiteout on top layer
+- mkdir() /olddir after above -> creates on top layer w/ opaque flag
+- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
+- rename() dir -> EXDEV
+
+Getting to a root file system with a writable overlay:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the writable overlay on /newroot:
+ # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Writable overlays are implemented as an integral part of the VFS,
+rather than as a VFS client file system (i.e., a stacked file system
+like unionfs or ecryptfs). Implementing writable overlays inside the
+VFS eliminates the need for duplicate copies of VFS data structures,
+unnecessary indirection, and code duplication, but requires very
+maintainable, low-to-zero overhead code. Writable overlays require no
+change to file systems serving as the read-only layer, and requires
+some minor support from file systems serving as the read-write layer.
+File systems that want to be the writable layer must implement the new
+->whiteout() and ->fallthru() inode operations, which create special
+dummy directory entries.
+
+union_mount structure
+---------------------
+
+The primary data structure for writable overlays is the union_mount
+structure, which connects overlapping directory dentries into a "union
+stack":
+
+struct union_mount {
+ atomic_t u_count; /* reference count */
+ struct mutex u_mutex;
+ struct list_head u_unions; /* list head for d_unions */
+ struct list_head u_list; /* list head for mnt_unions */
+ struct hlist_node u_hash; /* list head for searching */
+ struct hlist_node u_rhash; /* list head for reverse searching */
+
+ struct path u_this; /* this is me */
+ struct path u_next; /* this is what I overlay */
+};
+
+The union_mount is referenced from the corresponding directory's
+dentry:
+
+struct dentry {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * The following fields are used by the VFS based union mount
+ * implementation. Both are protected by union_lock!
+ */
+ struct list_head d_unions; /* list of union_mounts */
+ unsigned int d_unionized; /* unions referencing this dentry */
+#endif
+[...]
+};
+
+Each top layer directory with the potential for a lookup to fall
+through to the bottom layer has a union_mount structure stored in a
+union_mount hash table. The union_mount's can be looked up both by the
+top layer's path (via union_lookup()) and the bottom layer's path (via
+union_rlookup()). Once you have the path (vfsmount and dentry pair)
+of a file, the union stack can be followed down, layer by layer, with
+follow_union_down(), and up with follow_union_mount().
+
+All union_mount's are allocated from a kmem cache when the
+corresponding dentries are created. union_mount's are allocated when
+the first referencing dentry is allocated and freed when all of the
+referencing dentries are freed - that is, the dcache drives the union
+cache. While writable overlays only use two layers, the union stack
+infrastructure is capable of supporting an arbitrary number of file
+system layers (leaving aside locking issues).
+
+Todo:
+
+- Rename union_mount structure - it's per directory, not per mount
+
+Code paths
+----------
+
+Writable overlays modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Path lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Writable overlays are created in two steps:
+
+1. Mount the bottom layer file system read-only in the usual manner.
+2. Mount the top layer with the "-o union" option at the same mountpoint.
+
+The bottom layer must be read-only and the top layer must be
+read-write and support whiteouts and fallthrus (indicated by setting
+the MS_WHITEOUT flag). Currently, the top layer is forced to
+"noatime" to avoid a copyup on every access of a file. Supporting
+atime with the current infrastructure would require a copyup on every
+open().
+
+Currently, the top layer covers all submounts on the read-only file
+system. This can be inconvenient; e.g., mounting a writable overlay
+on the root file system after procfs has been mounted. It's not clear
+what the right behavior is. Also, it may be smarter to mount both
+read-only and read-write layers in one step, but the mount options get
+pretty ugly.
+
+pivot_root() is supported and is the recommended way to get to a root
+file system with a writable overlay.
+
+Todo:
+
+- Rename "-o union" mount option - "overlay"?
+- Don't permit mounting over read-write submounts
+- Choose submount covering behavior
+- Allow atime?
+
+Really really read-only file systems: In Linux, any individual file
+system may be mounted at multiple places in the namespace. The file
+system may change from read-only to read-write while still mounted.
+Thus, simply checking that the bottom layer is read-only at the time
+the writable overlay is mounted over it is pointless, since at any
+time the bottom layer may become read-write.
+
+We need to guarantee that a file system will be read-only for as long
+as it is the bottom layer of a writable overlay. To do this, we track
+the number of "read-only users" of a file system in its VFS superblock
+structure. When we mount a writable overlay over a file system, we
+increment its read-only user count. The file system can only be
+mounted read-write if its read-only users count is zero.
+
+Todo:
+
+- Support really really read-only NFS mounts. See discussion here:
+
+ http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Path lookup
+-----------
+
+Much of the action in writable overlasy happens during lookup().
+First, if we lookup a directory on the bottom layer that doesn't yet
+exist on the top layer, __link_path_walk() always create a matching
+directory on the top layer. This way, we never have to walk back up a
+path, creating directories as we go, before we can copyup a file.
+Second, if we need to copy up a file, we first (re)look it up with the
+LOOKUP_TOPMOST flag, which instructs __link_path_walk() to create it
+on the top layer. Neither directory entries nor file data are copied
+up in __link_path_walk() - that happens after the lookup, in the
+caller.
+
+The main cut-out to writable overlay code is in do_lookup():
+
+static int do_lookup(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ int err;
+
+ if (IS_MNT_UNION(nd->path.mnt))
+ goto need_union_lookup;
+[...]
+need_union_lookup:
+ err = cache_lookup_union(nd, name, path);
+ if (!err && path->dentry)
+ goto done;
+
+ err = real_lookup_union(nd, name, path);
+ if (err)
+ goto fail;
+ goto done;
+
+cache_lookup_union() looks for the dentry in the dcache, starting at
+the top layer and following down. If it finds nothing, it returns a
+negative dentry from the top layer. If it finds a directory, it looks
+for the same directory in the bottom layer; if that exists, it
+allocates a union_mount struct and hangs the bottom layer dentry off
+of it. real_lookup_union() does the same for uncached entries.
+
+Todo:
+
+- Reorganize cache/hash/real lookup code - lots of code duplication
+- Turn create-on-topmost test into #ifdef'able function
+- Rewrite with assumption that topmost directory always exists
+- Remove duplicated tests and other duplicated code
+
+File copyup
+-----------
+
+Any system call that alters an existing file on the bottom layer
+(including creating or moving a hard link to it) will trigger a copyup
+of the target file to the top layer (via union_copyup() or
+__union_copyup()). This includes:
+
+ - open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT)
+ - truncate()/ftruncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chattr()
+
+Copyup of a file DOES NOT occur on:
+
+ - open(O_RDONLY) if noatime
+ - stat() if no atime
+ - creat()/mkdir()/mknod()
+ - symlink()
+ - unlink()/rmdir()
+
+From an application's point of view, the result of an in-kernel file
+copyup is the logical equivalent of another application updating the
+file via the rename() pattern: creat() a new file, copy the data over,
+make changes the copy, and rename() over the old version. Any
+existing open file descriptors for that file (including those in the
+same application) refer to a now invisible and unreferenced object
+that used to have the same pathname. Only opens that occur after the
+copyup will see updates to the file.
+
+Todo:
+
+- copyup on chown()/chmod()/chattr()
+- copyup if atime is enabled?
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed
+in a system call before copying a file up to avoid unnecessary IO. At
+present, the permission check for a single system call may be spread
+out over many hundreds of lines of code (e.g., open()). In order to
+check permissions, we occasionally need to determine if there is a
+writable overlay on top of this inode. This requires a full path, but
+often we only have the inode at this point. In particular,
+inode_permission() returns EROFS if the inode is on a read-only file
+system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+Another trouble-maker is may_open(), which both checks permissions for
+open AND truncates the file if O_TRUNC is specified. It doesn't make
+any sense to copy up the file and then let may_open() truncate it, but
+we can't copy it after may_open() truncates it either. The current
+ugly hack is to pass the full nameidata to may_open() and copyup
+inside may_open().
+
+Some solutions:
+
+- Create __inode_permission() and pass it a flag telling it whether or
+ not to check for a read-only fs. Create union_permission() which
+ takes a path, checks for a union mount, and sets the rofs flag.
+ Place the file copyup call after all the permission checks are
+ completed. Push down the full path into the functions that need it
+ and currently only take the dentry or inode.
+
+- For each instance in which we might want to copyup, move permission
+ checks into a new function and call it from a level at which we
+ still have the full path. Pass it an "ignore read-only fs" flag if
+ the file is on a union mount. Pass around the ignore-rofs flag
+ inside the function doing permission checks. If all the permission
+ checks complete successfully, copyup the file. Would require moving
+ truncate out of may_open().
+
+Todo:
+ - On truncate, only copy up the N bytes of file data requested
+ - Make sure above handles truncate beyond EOF correctly
+ - File copyup on chown()/chmod()/chattr() etc.
+ - File copyup on open(O_APPEND)
+ - File copyup on open(O_DIRECT)
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are
+#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
+nearly all cases (see include/linux/union.h). The union-specific code
+in the cache lookup path is out of line.
+
+Currently, is_unionized() is pretty heavy-weight: it walks up the
+mount hierarchy, grabbing the vfsmount lock at each level. It may be
+possible to simplify this greatly if a writable layer can only cover
+exactly one mount, rather than a tree of mounts.
+
+Todo:
+
+ - Turn copyup in __link_path_walk() into #ifdef'd function
+ - Do performance tests
+ - Optimize is_unionized()
+ - Properly #ifdef out mount path code
+
+Locking strategy
+================
+
+The current writable overlay locking strategy is based on the
+following rules:
+
+* Exactly two file systems are unioned
+* The bottom file system is always read-only
+* The top file system is always read-write
+ => A file system can never a top and a bottom layer at the same time
+
+Additionally, the top layer (the writable overlay) may only be mounted
+exactly once. Don't think of the writable overlay as a separate
+independent file system; when it is mounted as a writable overlay, it
+is only a file system in conjunction with the read-only bottom layer.
+The read-only bottom layer is an independent file system in and of
+itself and can be mounted elsewhere, including as the bottom layer for
+another writable overlay.
+
+Thus, we may define a stable locking order in terms of top layer and
+bottom layer locks, since a top layer is never a bottom layer and a
+bottom layer is never a top layer. Objects from the bottom layer are
+never changed (so don't need write locks) and only require atomic
+operations to manage kernel data structures (ref counts, etc.).
+
+Another simplifying assumption is that all directories in a pathname
+exist on the top layer, as they are created step-by-step during
+lookup. This prevents us from ever having to walk backwards up the
+path creating directory entries, which can get complicated especially
+when you consider the need to prevent topology changes. By
+implication, parent directories during any operation (rename(),
+unlink(),etc.) are from the top layer. Dentries for directories from
+the bottom layer are only ever used by lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file
+systems: A mounted over B, and B mounted over A. Sometimes locks on
+objects in both A and B will have to be held simultanously. What
+order should they be acquired in? Simply acquiring them from top to
+bottom will create a lock-ordering problem - one thread acquires lock
+on object from A and then tries for a lock on object from B, while
+another thread grabs the lock on object from B and then waits for the
+lock on object from A. Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety
+of nasty corner cases arise when more than one layer is changing at
+the same time. Changes in the directory topology and their effect on
+inheritance are of special concern. Al Viro's canonical email on the
+subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first
+place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies
+objects from both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Holds i_mutex on top layer directory inode while doing lookups on
+bottom layer. Grabs i_mutex on bottom layer off and on.
+
+Todo:
+ - Is i_mutex on lower directory necessary?
+
+File copyup in general:
+
+File copyup occurs while holding i_mutex on the parent directory of
+the top layer. As noted before, an in-kernel file copyup is the
+logical equivalent of a userspace rename() of an identical file on to
+this pathname.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top
+layer. Followed by a normal link() operation.
+
+rename():
+
+First, renaming of directories returns EXDEV. It's not at all
+reasonable to recursively copy directory trees and userspace has to
+handle this case anyway.
+
+Rename involves two operations on a writable overlay: (1) creation of
+a whiteout covering the source of the rename, (2) a copyup of the file
+from the bottom layer. The file copyup does not need to happen
+atomically, only the whiteout and the new link to the file.
+
+I propose that we copyup the source file to the "old" name (rather
+than directly to the "new" name), and then perform the normal file
+system rename operation. The only addition is creation of whiteout
+for the old name.
+
+The current rename() implementation is just a hack to get things
+working and doesn't work at all as described above.
+
+Lock order: The file copyup happens before the rename() lock. When we
+create the whiteout, we will already have the directory i_mutex.
+Otherwise, locking as usual.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir(). We hold the
+top layer directory i_mutex throughout. A fallthru is created for
+each entry that appears only on the lower layer.
+
+Current patch takes the i_mutex on the bottom layer directory, which
+doesn't seem to be necessary.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really
+really read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT flag.
+
+Whiteouts and fallthrus are most similar to symlinks, since they
+redirect to an object possibly located in another file system without
+keeping a reference on it.
+
+Todo:
+
+- Return correct inode number in d_ino member of struct dirent by one of:
+ - Save inode number of target in fallthru entry itself
+ - Lookup inode number during readdir()
+- Try re-implementing ext2 as special symlinks - may be much simpler
+- Implement ext3 (also as symlinks?)
+- Implement btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer. File systems must
+explicitly support whiteouts and fallthrus in order to be a read-write
+layer. This patch set implements whiteouts for ext2, tmpfs, and
+jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only
+layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer. NFS as
+read-only layer requires support from the server to honor the
+read-only guarantee needed for the bottom layer. To do this, the
+server needs to revoke access to clients requesting read-only file
+systems if the exported file system is remounted read-write or
+unmounted (during which arbitrary changes can occur). Some recent
+discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the
+->whiteout() and ->fallthru() methods. DT_WHT directory entries are
+theoretically already supported.
+
+Also, technically the requirement for a readdir() cookie that is
+stable across reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Implement whiteout()/fallthru() for NFS
+- Guarantee really really read-only on NFS exports
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass
+the corresponding MS_UNION flag to the kerel. A util-linux git
+tree with writable overlay support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus. An
+e2fsprogs git tree with writable overlay support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.
+While the directory type for whiteouts, DT_WHT, has been defined for
+many years, very little userland code handles them. Userland will
+never see fallthru directory entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+- Link count may be wrong for files on bottom layer with > 1 link count
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+ rename(). Any existing open file descriptors will continue to refer
+ to the read-only copy on the bottom layer and will not see any
+ changes that occur after the copy-up.
+- rename() of directory fails with EXDEV
+
+Status
+======
+
+The current writable overlays patch set varies between RFC/prototype
+and pretty stable, depending on the particular patch. The current
+patch set boots to multi-user mode with a writable overlay root file
+system (albeit with some complaints). Some parts of the code were
+written years ago and have been reviewed, rewritten and tested many
+times. Other parts were written last month and need review,
+rewriting, and testing. The commit messages note the state of each
+patch.
+
+The current patch set is against 2.6.31. You can find it here, in the
+branch "overlay":
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Non-features
+------------
+
+Features we do not currently plan to support as part of writable
+overlays:
+
+Online upgrade: E.g., installing software on a file system NFS
+exported to clients while the clients are still up and running.
+Allowing the read-only bottom layer to change while the writable
+overlay file system is mounted invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across
+layers for directories. Doing an in-kernel copy of a single file is
+bad enough. Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the
+ability to create persistent directory entries on the top layer file
+system (which may be tmpfs). Numerous alternatives (including
+in-kernel or in-application caching) exist and are compatible with
+writable overlays with its writing-readdir() implementation disabled.
+Creating a readdir() cookie that is stable across multiple readdir()s
+requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Aggregation of multiple read-only file systems: While perfectly
+reasonable from a user perspective, we just aren't smart enough to
+figure out the locking problems from a kernel perspective. Sorry!
+
+Often these features are supported by other unioning file systems or
+by other versions of union mounts.
+
+Contributing to writable overlays
+=================================
+
+The writable overlays web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The mailing list for discussing writable overlays is:
+
[email protected]
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
--
1.6.3.3

2009-10-21 19:26:39

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 19/41] union-mount: Introduce MNT_UNION and MS_UNION flags

From: Jan Blunck <[email protected]>

Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:

git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 5 ++++-
include/linux/fs.h | 1 +
include/linux/mount.h | 1 +
3 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 4cd43ea..81b3188 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -770,6 +770,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_STRICTATIME, ",strictatime" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
@@ -1925,10 +1926,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_UNION);

/* ... and get the mountpoint */
retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d13de8a..efea78c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -188,6 +188,7 @@ struct inodes_stat_t {
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 5d52753..e175c47 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -35,6 +35,7 @@ struct mnt_namespace;
#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
#define MNT_PNODE_MASK 0x3000 /* propagation flag mask */
+#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */

struct vfsmount {
struct list_head mnt_hash;
--
1.6.3.3

2009-10-21 19:21:06

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 20/41] union-mount: Introduce union_mount structure

From: Jan Blunck <[email protected]>

This patch adds the basic structures of VFS based union mounts. It is a new
implementation based on some of my old ideas that influenced Bharata B Rao
<[email protected]> who came up with the proposal to let the
union_mount struct only point to the next layer in the union stack. I rewrote
nearly all of the central patches around lookup and the dcache interaction.

Advantages of the new implementation:
- the new union stack is no longer tied directly to one dentry
- the union stack enables dentries to be part of more than one union
(bind mounts)
- it is unnecessary to traverse the union stack when de/referencing a dentry
- caching of union stack information still driven by dentry cache

XXX - is_unionized() is pretty heavy-weight for non-union file systems
on a union mount-enabled kernel. May be simplified by assuming one or
more of:

- Two layers only
- One-to-one association between layers (doesn't union submounts)
- Writable layer mounted in only one place

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/Kconfig | 13 ++
fs/Makefile | 1 +
fs/dcache.c | 4 +
fs/union.c | 332 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 9 ++
include/linux/union.h | 61 +++++++++
6 files changed, 420 insertions(+), 0 deletions(-)
create mode 100644 fs/union.c
create mode 100644 include/linux/union.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 0e7da7b..3e4f664 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,19 @@ source "fs/notify/Kconfig"

source "fs/quota/Kconfig"

+config UNION_MOUNT
+ bool "Writable overlays (union mounts) (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ Writable overlays allow you to mount a transparent writable
+ layer over a read-only file system, for example, an ext3
+ partition on a hard drive over a CD-ROM root file system
+ image.
+
+ See <file:Documentation/filesystems/union-mounts.txt> for details.
+
+ If unsure, say N.
+
source "fs/autofs/Kconfig"
source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..4ed672e 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/
obj-$(CONFIG_GENERIC_ACL) += generic_acl.o

obj-y += quota/
+obj-$(CONFIG_UNION_MOUNT) += union.o

obj-$(CONFIG_PROC_FS) += proc/
obj-y += partitions/
diff --git a/fs/dcache.c b/fs/dcache.c
index 1fae1df..56bd05f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1046,6 +1046,10 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
+#ifdef CONFIG_UNION_MOUNT
+ INIT_LIST_HEAD(&dentry->d_unions);
+ dentry->d_unionized = 0;
+#endif

if (parent) {
dentry->d_parent = dget(parent);
diff --git a/fs/union.c b/fs/union.c
new file mode 100644
index 0000000..d1950c2
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,332 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/hash.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/fs_struct.h>
+#include <linux/union.h>
+
+/*
+ * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
+ * should try to make this good - I've just made it work.
+ */
+static unsigned int union_hash_mask __read_mostly;
+static unsigned int union_hash_shift __read_mostly;
+static struct hlist_head *union_hashtable __read_mostly;
+static unsigned int union_rhash_mask __read_mostly;
+static unsigned int union_rhash_shift __read_mostly;
+static struct hlist_head *union_rhashtable __read_mostly;
+
+/*
+ * Locking Rules:
+ * - dcache_lock (for union_rlookup() only)
+ * - union_lock
+ */
+DEFINE_SPINLOCK(union_lock);
+
+static struct kmem_cache *union_cache __read_mostly;
+
+static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
+{
+ unsigned long tmp;
+
+ tmp = ((unsigned long)mnt * (unsigned long)dentry) ^
+ (GOLDEN_RATIO_PRIME + (unsigned long)mnt) / L1_CACHE_BYTES;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> union_hash_shift);
+ return tmp & union_hash_mask;
+}
+
+static __initdata unsigned long union_hash_entries;
+
+static int __init set_union_hash_entries(char *str)
+{
+ if (!str)
+ return 0;
+ union_hash_entries = simple_strtoul(str, &str, 0);
+ return 1;
+}
+
+__setup("union_hash_entries=", set_union_hash_entries);
+
+static int __init init_union(void)
+{
+ int loop;
+
+ union_cache = KMEM_CACHE(union_mount, SLAB_PANIC | SLAB_MEM_SPREAD);
+ union_hashtable = alloc_large_system_hash("Union-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ &union_hash_shift,
+ &union_hash_mask,
+ 0);
+
+ for (loop = 0; loop < (1 << union_hash_shift); loop++)
+ INIT_HLIST_HEAD(&union_hashtable[loop]);
+
+
+ union_rhashtable = alloc_large_system_hash("rUnion-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ &union_rhash_shift,
+ &union_rhash_mask,
+ 0);
+
+ for (loop = 0; loop < (1 << union_rhash_shift); loop++)
+ INIT_HLIST_HEAD(&union_rhashtable[loop]);
+
+ return 0;
+}
+
+fs_initcall(init_union);
+
+struct union_mount *union_alloc(struct dentry *this, struct vfsmount *this_mnt,
+ struct dentry *next, struct vfsmount *next_mnt)
+{
+ struct union_mount *um;
+
+ BUG_ON(!S_ISDIR(this->d_inode->i_mode));
+ BUG_ON(!S_ISDIR(next->d_inode->i_mode));
+
+ um = kmem_cache_alloc(union_cache, GFP_ATOMIC);
+ if (!um)
+ return NULL;
+
+ atomic_set(&um->u_count, 1);
+ INIT_LIST_HEAD(&um->u_unions);
+ INIT_HLIST_NODE(&um->u_hash);
+ INIT_HLIST_NODE(&um->u_rhash);
+
+ um->u_this.mnt = this_mnt;
+ um->u_this.dentry = this;
+ um->u_next.mnt = mntget(next_mnt);
+ um->u_next.dentry = dget(next);
+
+ return um;
+}
+
+struct union_mount *union_get(struct union_mount *um)
+{
+ BUG_ON(!atomic_read(&um->u_count));
+ atomic_inc(&um->u_count);
+ return um;
+}
+
+static int __union_put(struct union_mount *um)
+{
+ if (!atomic_dec_and_test(&um->u_count))
+ return 0;
+
+ BUG_ON(!hlist_unhashed(&um->u_hash));
+ BUG_ON(!hlist_unhashed(&um->u_rhash));
+
+ kmem_cache_free(union_cache, um);
+ return 1;
+}
+
+void union_put(struct union_mount *um)
+{
+ struct path tmp = um->u_next;
+
+ if (__union_put(um))
+ path_put(&tmp);
+}
+
+static void __union_hash(struct union_mount *um)
+{
+ hlist_add_head(&um->u_hash, union_hashtable +
+ hash(um->u_this.dentry, um->u_this.mnt));
+ hlist_add_head(&um->u_rhash, union_rhashtable +
+ hash(um->u_next.dentry, um->u_next.mnt));
+}
+
+static void __union_unhash(struct union_mount *um)
+{
+ hlist_del_init(&um->u_hash);
+ hlist_del_init(&um->u_rhash);
+}
+
+struct union_mount *union_lookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct hlist_head *head = union_hashtable + hash(dentry, mnt);
+ struct hlist_node *node;
+ struct union_mount *um;
+
+ hlist_for_each_entry(um, node, head, u_hash) {
+ if ((um->u_this.dentry == dentry) &&
+ (um->u_this.mnt == mnt))
+ return um;
+ }
+
+ return NULL;
+}
+
+struct union_mount *union_rlookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct hlist_head *head = union_rhashtable + hash(dentry, mnt);
+ struct hlist_node *node;
+ struct union_mount *um;
+
+ hlist_for_each_entry(um, node, head, u_rhash) {
+ if ((um->u_next.dentry == dentry) &&
+ (um->u_next.mnt == mnt))
+ return um;
+ }
+
+ return NULL;
+}
+
+/*
+ * is_unionized - check if a dentry lives on a union mounted file system
+ *
+ * This tests if a dentry is living on an union mounted file system by walking
+ * the file system hierarchy.
+ */
+int is_unionized(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct path this = { .mnt = mntget(mnt),
+ .dentry = dget(dentry) };
+ struct vfsmount *tmp;
+
+ do {
+ /* check if there is an union mounted on top of us */
+ spin_lock(&vfsmount_lock);
+ list_for_each_entry(tmp, &this.mnt->mnt_mounts, mnt_child) {
+ if (!(tmp->mnt_flags & MNT_UNION))
+ continue;
+ /* Isn't this a bug? */
+ if (this.dentry->d_sb != tmp->mnt_mountpoint->d_sb)
+ continue;
+ if (is_subdir(this.dentry, tmp->mnt_mountpoint)) {
+ spin_unlock(&vfsmount_lock);
+ path_put(&this);
+ return 1;
+ }
+ }
+ spin_unlock(&vfsmount_lock);
+
+ /* check our mountpoint next */
+ tmp = mntget(this.mnt->mnt_parent);
+ dput(this.dentry);
+ this.dentry = dget(this.mnt->mnt_mountpoint);
+ mntput(this.mnt);
+ this.mnt = tmp;
+ } while (this.mnt != this.mnt->mnt_parent);
+
+ path_put(&this);
+ return 0;
+}
+
+int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
+ struct vfsmount *dest_mnt, struct dentry *dest_dentry)
+{
+ struct union_mount *this, *um;
+
+ BUG_ON(!IS_MNT_UNION(mnt));
+
+ this = union_alloc(dentry, mnt, dest_dentry, dest_mnt);
+ if (!this)
+ return -ENOMEM;
+
+ spin_lock(&union_lock);
+ um = union_lookup(dentry, mnt);
+ if (um) {
+ BUG_ON((um->u_next.dentry != dest_dentry) ||
+ (um->u_next.mnt != dest_mnt));
+ spin_unlock(&union_lock);
+ union_put(this);
+ return 0;
+ }
+ __union_hash(this);
+ spin_unlock(&union_lock);
+ return 0;
+}
+
+/*
+ * follow_union_down - follow the union stack one layer down
+ *
+ * This is called to traverse the union stack from one layer to the next
+ * overlayed one. follow_union_down() is called by various lookup functions
+ * that are aware of union mounts.
+ *
+ * Returns non-zero if followed to the next layer, zero otherwise.
+ */
+int follow_union_down(struct vfsmount **mnt, struct dentry **dentry)
+{
+ struct union_mount *um;
+
+ if (!IS_MNT_UNION(*mnt))
+ return 0;
+
+ spin_lock(&union_lock);
+ um = union_lookup(*dentry, *mnt);
+ spin_unlock(&union_lock);
+ if (um) {
+ path_get(&um->u_next);
+ dput(*dentry);
+ *dentry = um->u_next.dentry;
+ mntput(*mnt);
+ *mnt = um->u_next.mnt;
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * follow_union_mount - follow the union stack to the topmost layer
+ *
+ * This is called to traverse the union stack to the topmost layer. This is
+ * necessary for following parent pointers in an union mount.
+ *
+ * Returns none zero if followed to the topmost layer, zero otherwise.
+ */
+int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
+{
+ struct union_mount *um;
+ int res = 0;
+
+ while (IS_UNION(*dentry)) {
+ spin_lock(&dcache_lock);
+ spin_lock(&union_lock);
+ um = union_rlookup(*dentry, *mnt);
+ if (um)
+ path_get(&um->u_this);
+ spin_unlock(&union_lock);
+ spin_unlock(&dcache_lock);
+
+ /*
+ * Q: Aaargh, how do I validate the topmost dentry pointer?
+ * A: Eeeeasy! We took the dcache_lock and union_lock. Since
+ * this protects from any dput'ng going on, we know that the
+ * dentry is valid since the union is unhashed under
+ * dcache_lock too.
+ */
+ if (!um)
+ break;
+ dput(*dentry);
+ *dentry = um->u_this.dentry;
+ mntput(*mnt);
+ *mnt = um->u_this.mnt;
+ res = 1;
+ }
+
+ return res;
+}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7648b49..4d48c20 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -101,6 +101,15 @@ struct dentry {
struct dentry *d_parent; /* parent directory */
struct qstr d_name;

+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * The following fields are used by the VFS based union mount
+ * implementation. Both are protected by union_lock!
+ */
+ struct list_head d_unions; /* list of union_mount's */
+ unsigned int d_unionized; /* unions referencing this dentry */
+#endif
+
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
diff --git a/include/linux/union.h b/include/linux/union.h
new file mode 100644
index 0000000..0c85312
--- /dev/null
+++ b/include/linux/union.h
@@ -0,0 +1,61 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#include <linux/list.h>
+#include <asm/atomic.h>
+
+struct dentry;
+struct vfsmount;
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * The new union mount structure.
+ */
+struct union_mount {
+ atomic_t u_count; /* reference count */
+ struct mutex u_mutex;
+ struct list_head u_unions; /* list head for d_unions */
+ struct hlist_node u_hash; /* list head for searching */
+ struct hlist_node u_rhash; /* list head for reverse searching */
+
+ struct path u_this; /* this is me */
+ struct path u_next; /* this is what I overlay */
+};
+
+#define IS_UNION(dentry) (!list_empty(&(dentry)->d_unions) || \
+ (dentry)->d_unionized)
+#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+
+extern int is_unionized(struct dentry *, struct vfsmount *);
+extern int append_to_union(struct vfsmount *, struct dentry *,
+ struct vfsmount *, struct dentry *);
+extern int follow_union_down(struct vfsmount **, struct dentry **);
+extern int follow_union_mount(struct vfsmount **, struct dentry **);
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define IS_UNION(x) (0)
+#define IS_MNT_UNION(x) (0)
+#define is_unionized(x, y) (0)
+#define append_to_union(x1, y1, x2, y2) ({ BUG(); (0); })
+#define follow_union_down(x, y) ({ (0); })
+#define follow_union_mount(x, y) ({ (0); })
+
+#endif /* CONFIG_UNION_MOUNT */
+#endif /* __KERNEL__ */
+#endif /* __LINUX_UNION_H */
--
1.6.3.3

2009-10-21 19:21:03

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 21/41] union-mount: Drive the union cache via dcache

From: Jan Blunck <[email protected]>

If a dentry is removed from dentry cache because its usage count drops to
zero, the references to the underlying layer of the unions the dentry is in
are droped too. Therefore the union cache is driven by the dentry cache.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 10 ++++++-
fs/union.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 8 +++++
include/linux/union.h | 6 ++++
4 files changed, 97 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 56bd05f..d80a3bb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -18,6 +18,7 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/union.h>
#include <linux/fsnotify.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -188,11 +189,14 @@ static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
list_add(&dentry->d_lru, list);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
+ __shrink_d_unions(dentry, list);
return NULL;
}

- /*drops the locks, at that point nobody can reach this dentry */
+ /* drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
+ /* If the dentry was in an union delete them */
+ __shrink_d_unions(dentry, list);
if (IS_ROOT(dentry))
parent = NULL;
else
@@ -784,6 +788,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
iput(inode);
}

+ shrink_d_unions(dentry);
d_free(dentry);

/* finished when we fall off the top of the tree,
@@ -1614,7 +1619,9 @@ void d_delete(struct dentry * dentry)
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (atomic_read(&dentry->d_count) == 1) {
+ __d_drop_unions(dentry);
dentry_iput(dentry);
+ shrink_d_unions(dentry);
fsnotify_nameremove(dentry, isdir);
return;
}
@@ -1625,6 +1632,7 @@ void d_delete(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

+ shrink_d_unions(dentry);
fsnotify_nameremove(dentry, isdir);
}

diff --git a/fs/union.c b/fs/union.c
index d1950c2..6b99393 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -14,6 +14,7 @@

#include <linux/bootmem.h>
#include <linux/init.h>
+#include <linux/module.h>
#include <linux/types.h>
#include <linux/hash.h>
#include <linux/fs.h>
@@ -255,6 +256,8 @@ int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
union_put(this);
return 0;
}
+ list_add(&this->u_unions, &dentry->d_unions);
+ dest_dentry->d_unionized++;
__union_hash(this);
spin_unlock(&union_lock);
return 0;
@@ -330,3 +333,74 @@ int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)

return res;
}
+
+/*
+ * This must be called when unhashing a dentry. This is called with dcache_lock
+ * and unhashes all unions this dentry is in.
+ */
+void __d_drop_unions(struct dentry *dentry)
+{
+ struct union_mount *this, *next;
+
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions)
+ __union_unhash(this);
+ spin_unlock(&union_lock);
+}
+EXPORT_SYMBOL_GPL(__d_drop_unions);
+
+/*
+ * This must be called after __d_drop_unions() without holding any locks.
+ * Note: The dentry might still be reachable via a lookup but at that time it
+ * already a negative dentry. Otherwise it would be unhashed. The union_mount
+ * structure itself is still reachable through mnt->mnt_unions (which we
+ * protect against with union_lock).
+ */
+void shrink_d_unions(struct dentry *dentry)
+{
+ struct union_mount *this, *next;
+
+repeat:
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
+ BUG_ON(!hlist_unhashed(&this->u_hash));
+ BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_unions);
+ this->u_next.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(this);
+ goto repeat;
+ }
+ spin_unlock(&union_lock);
+}
+
+extern void __dput(struct dentry *, struct list_head *, int);
+
+/*
+ * This is the special variant for use in dput() only.
+ */
+void __shrink_d_unions(struct dentry *dentry, struct list_head *list)
+{
+ struct union_mount *this, *next;
+
+ BUG_ON(!d_unhashed(dentry));
+
+repeat:
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
+ struct dentry *n_dentry = this->u_next.dentry;
+ struct vfsmount *n_mnt = this->u_next.mnt;
+
+ BUG_ON(!hlist_unhashed(&this->u_hash));
+ BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_unions);
+ this->u_next.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ if (__union_put(this)) {
+ __dput(n_dentry, list, 0);
+ mntput(n_mnt);
+ }
+ goto repeat;
+ }
+ spin_unlock(&union_lock);
+}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4d48c20..730c432 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -215,12 +215,20 @@ extern seqlock_t rename_lock;
* __d_drop requires dentry->d_lock.
*/

+#ifdef CONFIG_UNION_MOUNT
+extern void __d_drop_unions(struct dentry *);
+#endif
+
static inline void __d_drop(struct dentry *dentry)
{
if (!(dentry->d_flags & DCACHE_UNHASHED)) {
dentry->d_flags |= DCACHE_UNHASHED;
hlist_del_rcu(&dentry->d_hash);
}
+#ifdef CONFIG_UNION_MOUNT
+ /* remove dentry from the union hashtable */
+ __d_drop_unions(dentry);
+#endif
}

static inline void d_drop(struct dentry *dentry)
diff --git a/include/linux/union.h b/include/linux/union.h
index 0c85312..b035a82 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -46,6 +46,9 @@ extern int append_to_union(struct vfsmount *, struct dentry *,
struct vfsmount *, struct dentry *);
extern int follow_union_down(struct vfsmount **, struct dentry **);
extern int follow_union_mount(struct vfsmount **, struct dentry **);
+extern void __d_drop_unions(struct dentry *);
+extern void shrink_d_unions(struct dentry *);
+extern void __shrink_d_unions(struct dentry *, struct list_head *);

#else /* CONFIG_UNION_MOUNT */

@@ -55,6 +58,9 @@ extern int follow_union_mount(struct vfsmount **, struct dentry **);
#define append_to_union(x1, y1, x2, y2) ({ BUG(); (0); })
#define follow_union_down(x, y) ({ (0); })
#define follow_union_mount(x, y) ({ (0); })
+#define __d_drop_unions(x) do { } while (0)
+#define shrink_d_unions(x) do { } while (0)
+#define __shrink_d_unions(x,y) do { } while (0)

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2009-10-21 19:26:00

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 22/41] union-mount: Some checks during namespace changes

From: Jan Blunck <[email protected]>

Add some additional checks when mounting something into an union.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 81b3188..dc01385 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -29,6 +29,7 @@
#include <linux/log2.h>
#include <linux/idr.h>
#include <linux/fs_struct.h>
+#include <linux/union.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
#include "pnode.h"
@@ -1427,6 +1428,10 @@ static int do_change_type(struct path *path, int flag)
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;

+ /* Don't change the type of union mounts */
+ if (IS_MNT_UNION(path->mnt))
+ return -EINVAL;
+
down_write(&namespace_sem);
if (type == MS_SHARED) {
err = invent_group_ids(mnt, recurse);
@@ -1478,6 +1483,18 @@ static int do_loopback(struct path *path, char *old_name, int recurse,
if (!mnt)
goto out;

+ /*
+ * Unions couldn't be writable if the filesystem doesn't know about
+ * whiteouts
+ */
+ err = -ENOTSUPP;
+ if ((mnt_flags & MNT_UNION) &&
+ !(mnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+ goto out;
+
+ if (mnt_flags & MNT_UNION)
+ mnt->mnt_flags |= MNT_UNION;
+
err = graft_tree(mnt, path);
if (err) {
LIST_HEAD(umount_list);
@@ -1571,6 +1588,13 @@ static int do_move_mount(struct path *path, char *old_name)
if (err)
return err;

+ /* moving to or from a union mount is not supported */
+ err = -EINVAL;
+ if (IS_MNT_UNION(path->mnt))
+ goto exit;
+ if (IS_MNT_UNION(old_path.mnt))
+ goto exit;
+
down_write(&namespace_sem);
while (d_mountpoint(path->dentry) &&
follow_down(path))
@@ -1628,6 +1652,7 @@ out:
up_write(&namespace_sem);
if (!err)
path_put(&parent_path);
+exit:
path_put(&old_path);
return err;
}
@@ -1685,6 +1710,15 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ /*
+ * Unions couldn't be writable if the filesystem doesn't know about
+ * whiteouts
+ */
+ err = -ENOTSUPP;
+ if ((mnt_flags & MNT_UNION) &&
+ !(newmnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+ goto unlock;
+
newmnt->mnt_flags = mnt_flags;
if ((err = graft_tree(newmnt, path)))
goto unlock;
--
1.6.3.3

2009-10-21 19:26:41

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 23/41] union-mount: Changes to the namespace handling

From: Jan Blunck <[email protected]>

Creates the proper struct union_mount when mounting something into a
union. If the topmost filesystem isn't capable of handling the white-out
filetype it could only be mount read-only.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 7 ++++++
fs/union.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mount.h | 3 ++
include/linux/union.h | 10 +++++++-
4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index dc01385..0280e5b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -158,6 +158,9 @@ struct vfsmount *alloc_vfsmnt(const char *name)
#else
mnt->mnt_writers = 0;
#endif
+#ifdef CONFIG_UNION_MOUNT
+ INIT_LIST_HEAD(&mnt->mnt_unions);
+#endif
}
return mnt;

@@ -470,6 +473,7 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)

static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
{
+ detach_mnt_union(mnt);
old_path->dentry = mnt->mnt_mountpoint;
old_path->mnt = mnt->mnt_parent;
mnt->mnt_parent = mnt;
@@ -493,6 +497,7 @@ static void attach_mnt(struct vfsmount *mnt, struct path *path)
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(path->mnt, path->dentry));
list_add_tail(&mnt->mnt_child, &path->mnt->mnt_mounts);
+ attach_mnt_union(mnt, path->mnt, path->dentry);
}

/*
@@ -515,6 +520,7 @@ static void commit_tree(struct vfsmount *mnt)
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(parent, mnt->mnt_mountpoint));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+ attach_mnt_union(mnt, mnt->mnt_parent, mnt->mnt_mountpoint);
touch_mnt_namespace(n);
}

@@ -986,6 +992,7 @@ void release_mounts(struct list_head *head)
struct dentry *dentry;
struct vfsmount *m;
spin_lock(&vfsmount_lock);
+ detach_mnt_union(mnt);
dentry = mnt->mnt_mountpoint;
m = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
diff --git a/fs/union.c b/fs/union.c
index 6b99393..341fc03 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -113,6 +113,7 @@ struct union_mount *union_alloc(struct dentry *this, struct vfsmount *this_mnt,

atomic_set(&um->u_count, 1);
INIT_LIST_HEAD(&um->u_unions);
+ INIT_LIST_HEAD(&um->u_list);
INIT_HLIST_NODE(&um->u_hash);
INIT_HLIST_NODE(&um->u_rhash);

@@ -256,6 +257,7 @@ int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
union_put(this);
return 0;
}
+ list_add(&this->u_list, &mnt->mnt_unions);
list_add(&this->u_unions, &dentry->d_unions);
dest_dentry->d_unionized++;
__union_hash(this);
@@ -365,6 +367,7 @@ repeat:
list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
BUG_ON(!hlist_unhashed(&this->u_hash));
BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_list);
list_del(&this->u_unions);
this->u_next.dentry->d_unionized--;
spin_unlock(&union_lock);
@@ -393,6 +396,7 @@ repeat:

BUG_ON(!hlist_unhashed(&this->u_hash));
BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_list);
list_del(&this->u_unions);
this->u_next.dentry->d_unionized--;
spin_unlock(&union_lock);
@@ -404,3 +408,56 @@ repeat:
}
spin_unlock(&union_lock);
}
+
+/*
+ * Remove all union_mounts structures belonging to this vfsmount from the
+ * union lookup hashtable and so on ...
+ */
+void shrink_mnt_unions(struct vfsmount *mnt)
+{
+ struct union_mount *this, *next;
+
+repeat:
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &mnt->mnt_unions, u_list) {
+ if (this->u_this.dentry == mnt->mnt_root)
+ continue;
+ __union_unhash(this);
+ list_del(&this->u_list);
+ list_del(&this->u_unions);
+ this->u_next.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(this);
+ goto repeat;
+ }
+ spin_unlock(&union_lock);
+}
+
+int attach_mnt_union(struct vfsmount *mnt, struct vfsmount *dest_mnt,
+ struct dentry *dest_dentry)
+{
+ if (!IS_MNT_UNION(mnt))
+ return 0;
+
+ return append_to_union(mnt, mnt->mnt_root, dest_mnt, dest_dentry);
+}
+
+void detach_mnt_union(struct vfsmount *mnt)
+{
+ struct union_mount *um;
+
+ if (!IS_MNT_UNION(mnt))
+ return;
+
+ shrink_mnt_unions(mnt);
+
+ spin_lock(&union_lock);
+ um = union_lookup(mnt->mnt_root, mnt);
+ __union_unhash(um);
+ list_del(&um->u_list);
+ list_del(&um->u_unions);
+ um->u_next.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(um);
+ return;
+}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index e175c47..70c4f1f 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -54,6 +54,9 @@ struct vfsmount {
struct list_head mnt_slave_list;/* list of slave mounts */
struct list_head mnt_slave; /* slave list entry */
struct vfsmount *mnt_master; /* slave is on master->mnt_slave_list */
+#ifdef CONFIG_UNION_MOUNT
+ struct list_head mnt_unions; /* list of union_mount structures */
+#endif
struct mnt_namespace *mnt_ns; /* containing namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
diff --git a/include/linux/union.h b/include/linux/union.h
index b035a82..0b6f356 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -30,8 +30,9 @@ struct union_mount {
atomic_t u_count; /* reference count */
struct mutex u_mutex;
struct list_head u_unions; /* list head for d_unions */
- struct hlist_node u_hash; /* list head for searching */
- struct hlist_node u_rhash; /* list head for reverse searching */
+ struct list_head u_list; /* list head for mnt_unions */
+ struct hlist_node u_hash; /* list head for seaching */
+ struct hlist_node u_rhash; /* list head for reverse seaching */

struct path u_this; /* this is me */
struct path u_next; /* this is what I overlay */
@@ -49,6 +50,9 @@ extern int follow_union_mount(struct vfsmount **, struct dentry **);
extern void __d_drop_unions(struct dentry *);
extern void shrink_d_unions(struct dentry *);
extern void __shrink_d_unions(struct dentry *, struct list_head *);
+extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
+ struct dentry *);
+extern void detach_mnt_union(struct vfsmount *);

#else /* CONFIG_UNION_MOUNT */

@@ -61,6 +65,8 @@ extern void __shrink_d_unions(struct dentry *, struct list_head *);
#define __d_drop_unions(x) do { } while (0)
#define shrink_d_unions(x) do { } while (0)
#define __shrink_d_unions(x,y) do { } while (0)
+#define attach_mnt_union(x, y, z) do { } while (0)
+#define detach_mnt_union(x) do { } while (0)

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2009-10-21 19:26:16

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 24/41] union-mount: Make lookup work for union-mounted file systems

From: Jan Blunck <[email protected]>

On union-mounted file systems the lookup function must also visit lower layers
of the union-stack when doing a lookup. This patches add support for
union-mounts to cached lookups and real lookups.

We have 3 different styles of lookup functions now:
- multiple pathname components, follow mounts, follow union, follow symlinks
- single pathname component, doesn't follow mounts, follow union, doesn't
follow symlinks
- single pathname component doesn't follow mounts, doesn't follow unions,
doesn't follow symlinks

XXX - Needs to be re-organized to reduce code duplication. But how?

- Create shared lookup_topmost() and build_union() functions that take
flags or function pointers for real_lookup(), cache_lookup(), etc.
- Push union code farther down into cache_lookup(), etc.
- (your idea here)

XXX - Symlinks to other file systems (and probably submounts) don't
work - see comment in do_lookup().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 483 ++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/namei.h | 6 +
2 files changed, 481 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 408380d..b279686 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -33,6 +33,7 @@
#include <linux/fcntl.h>
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
+#include <linux/union.h>
#include <asm/uaccess.h>

#define ACC_MODE(x) ("\000\004\002\006"[(x)&O_ACCMODE])
@@ -415,6 +416,173 @@ static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
return dentry;
}

+/**
+ * __cache_lookup_topmost - lookup the topmost (non-)negative dentry
+ *
+ * @nd - parent's nameidata
+ * @name - pathname part to lookup
+ * @path - found dentry for pathname part
+ *
+ * This is used for union mount lookups from dcache. The first non-negative
+ * dentry is searched on all layers of the union stack. Otherwise the topmost
+ * negative dentry is returned.
+ */
+static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct dentry *dentry;
+
+ dentry = d_lookup(nd->path.dentry, name);
+ if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
+ dentry = do_revalidate(dentry, nd);
+
+ /*
+ * Remember the topmost negative dentry in case we don't find anything
+ */
+ path->dentry = dentry;
+ path->mnt = dentry ? nd->path.mnt : NULL;
+
+ if (!dentry || dentry->d_inode)
+ return !dentry;
+
+ /* look for the first non-negative dentry */
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ dentry = d_hash_and_lookup(nd->path.dentry, name);
+
+ /*
+ * If parts of the union stack are not in the dcache we need
+ * to do a real lookup
+ */
+ if (!dentry)
+ goto out_dput;
+
+ /*
+ * If parts of the union don't survive the revalidation we
+ * need to do a real lookup
+ */
+ if (dentry->d_op && dentry->d_op->d_revalidate) {
+ dentry = do_revalidate(dentry, nd);
+ if (!dentry)
+ goto out_dput;
+ }
+
+ if (dentry->d_inode)
+ goto out_dput;
+
+ dput(dentry);
+ }
+
+ return !dentry;
+
+out_dput:
+ dput(path->dentry);
+ path->dentry = dentry;
+ path->mnt = dentry ? mntget(nd->path.mnt) : NULL;
+ return !dentry;
+}
+
+/**
+ * __cache_lookup_build_union - build the union stack for this part,
+ * cached version
+ *
+ * This is called after you have the topmost dentry in @path.
+ */
+static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path last = *path;
+ struct dentry *dentry;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ dentry = d_hash_and_lookup(nd->path.dentry, name);
+ if (!dentry)
+ return 1;
+
+ if (dentry->d_op && dentry->d_op->d_revalidate) {
+ dentry = do_revalidate(dentry, nd);
+ if (!dentry)
+ return 1;
+ }
+
+ if (!dentry->d_inode) {
+ dput(dentry);
+ continue;
+ }
+
+ /* only directories can be part of a union stack */
+ if (!S_ISDIR(dentry->d_inode->i_mode)) {
+ dput(dentry);
+ break;
+ }
+
+ /* Add the newly discovered dir to the union stack */
+ append_to_union(last.mnt, last.dentry, nd->path.mnt, dentry);
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+ last.dentry = dentry;
+ last.mnt = mntget(nd->path.mnt);
+ }
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+
+ return 0;
+}
+
+/**
+ * cache_lookup_union - lookup a single pathname part from dcache
+ *
+ * This is a union mount capable version of what d_lookup() & revalidate()
+ * would do. This function returns a valid (union) dentry on success.
+ *
+ * Remember: On failure it means that parts of the union aren't cached. You
+ * should call real_lookup() afterwards to find the proper (union) dentry.
+ */
+static int cache_lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ int res ;
+
+ if (!IS_MNT_UNION(nd->path.mnt)) {
+ path->dentry = cache_lookup(nd->path.dentry, name, nd);
+ path->mnt = path->dentry ? nd->path.mnt : NULL;
+ res = path->dentry ? 0 : 1;
+ } else {
+ struct path safe = {
+ .dentry = nd->path.dentry,
+ .mnt = nd->path.mnt
+ };
+
+ path_get(&safe);
+ res = __cache_lookup_topmost(nd, name, path);
+ if (res)
+ goto out;
+
+ /* only directories can be part of a union stack */
+ if (!path->dentry->d_inode ||
+ !S_ISDIR(path->dentry->d_inode->i_mode))
+ goto out;
+
+ /* Build the union stack for this part */
+ res = __cache_lookup_build_union(nd, name, path);
+ if (res) {
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ goto out;
+ }
+
+out:
+ path_put(&nd->path);
+ nd->path.dentry = safe.dentry;
+ nd->path.mnt = safe.mnt;
+ }
+
+ return res;
+}
+
/*
* Short-cut version of permission(), for calling by
* path_walk(), when dcache lock is held. Combines parts
@@ -536,6 +704,146 @@ out_unlock:
return res;
}

+/**
+ * __real_lookup_topmost - lookup topmost dentry, non-cached version
+ *
+ * If we reach a dentry with restricted access, we just stop the lookup
+ * because we shouldn't see through that dentry. Same thing for dentry
+ * type mismatch and whiteouts.
+ *
+ * FIXME:
+ * - handle DT_WHT
+ * - handle union stacks in use
+ * - handle union stacks mounted upon union stacks
+ * - avoid unnecessary allocations of union locks
+ */
+static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path next;
+ int err;
+
+ err = real_lookup(nd, name, path);
+ if (err)
+ return err;
+
+ if (path->dentry->d_inode)
+ return 0;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ name->hash = full_name_hash(name->name, name->len);
+ if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
+ err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+ name);
+ if (err < 0)
+ goto out;
+ }
+
+ err = real_lookup(nd, name, &next);
+ if (err)
+ goto out;
+
+ if (next.dentry->d_inode) {
+ dput(path->dentry);
+ mntget(next.mnt);
+ *path = next;
+ goto out;
+ }
+
+ dput(next.dentry);
+ }
+out:
+ if (err)
+ dput(path->dentry);
+ return err;
+}
+
+/**
+ * __real_lookup_build_union: build the union stack for this pathname
+ * part, non-cached version
+ *
+ * Called when not all parts of the union stack are in cache
+ */
+
+static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path last = *path;
+ struct path next;
+ int err = 0;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ /* We need to recompute the hash for lower layer lookups */
+ name->hash = full_name_hash(name->name, name->len);
+ if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
+ err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+ name);
+ if (err < 0)
+ goto out;
+ }
+
+ err = real_lookup(nd, name, &next);
+ if (err)
+ goto out;
+
+ if (!next.dentry->d_inode) {
+ dput(next.dentry);
+ continue;
+ }
+
+ /* only directories can be part of a union stack */
+ if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
+ dput(next.dentry);
+ break;
+ }
+
+ /* now we know we found something "real" */
+ append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+ last.dentry = next.dentry;
+ last.mnt = mntget(next.mnt);
+ }
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+out:
+ return err;
+}
+
+static int real_lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
+ int res ;
+
+ path_get(&safe);
+ res = __real_lookup_topmost(nd, name, path);
+ if (res)
+ goto out;
+
+ /* only directories can be part of a union stack */
+ if (!path->dentry->d_inode ||
+ !S_ISDIR(path->dentry->d_inode->i_mode))
+ goto out;
+
+ /* Build the union stack for this part */
+ res = __real_lookup_build_union(nd, name, path);
+ if (res) {
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ goto out;
+ }
+
+out:
+ path_put(&nd->path);
+ nd->path.dentry = safe.dentry;
+ nd->path.mnt = safe.mnt;
+ return res;
+}
+
/*
* Wrapper to retry pathname resolution whenever the underlying
* file system returns an ESTALE.
@@ -790,6 +1098,7 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
nd->path.mnt = parent;
}
follow_mount(&nd->path);
+ follow_union_mount(&nd->path.mnt, &nd->path.dentry);
}

/*
@@ -802,6 +1111,9 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
{
int err;

+ if (IS_MNT_UNION(nd->path.mnt))
+ goto need_union_lookup;
+
path->dentry = __d_lookup(nd->path.dentry, name);
path->mnt = nd->path.mnt;
if (!path->dentry)
@@ -810,7 +1122,25 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
goto need_revalidate;

done:
- __follow_mount(path);
+ if (nd->path.mnt != path->mnt) {
+ /*
+ * XXX FIXME: We only want to set this flag if we
+ * crossed from the top layer to the bottom layer of a
+ * union mount. But nd->path.mnt != path->mnt is also
+ * true when we cross from the top layer of a union
+ * mount to another file system, either by symlink or
+ * file system mounted on a directory in the union
+ * mount (probably - haven't tested).
+ *
+ * This might be an issue for every mnt/mnt comparison
+ * - or maybe just during the brief window between
+ * do_lookup() and do_follow_link() or follow_mount().
+ */
+ nd->um_flags |= LAST_LOWLEVEL;
+ follow_mount(path);
+ } else
+ __follow_mount(path);
+ follow_union_mount(&path->mnt, &path->dentry);
return 0;

need_lookup:
@@ -819,6 +1149,16 @@ need_lookup:
goto fail;
goto done;

+need_union_lookup:
+ err = cache_lookup_union(nd, name, path);
+ if (!err && path->dentry)
+ goto done;
+
+ err = real_lookup_union(nd, name, path);
+ if (err)
+ goto fail;
+ goto done;
+
need_revalidate:
path->dentry = do_revalidate(path->dentry, nd);
if (!path->dentry)
@@ -857,6 +1197,8 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
if (nd->depth)
lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);

+ follow_union_mount(&nd->path.mnt, &nd->path.dentry);
+
/* At this point we know we have a real path component. */
for(;;) {
unsigned long hash;
@@ -1041,6 +1383,7 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei

nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags;
+ nd->um_flags = 0;
nd->depth = 0;
nd->root.mnt = NULL;

@@ -1249,6 +1592,130 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
return err;
}

+static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path next;
+ int err;
+
+ err = lookup_hash(nd, name, path);
+ if (err)
+ return err;
+
+ if (path->dentry->d_inode)
+ return 0;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ name->hash = full_name_hash(name->name, name->len);
+ if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
+ err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+ name);
+ if (err < 0)
+ goto out;
+ }
+
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ err = lookup_hash(nd, name, &next);
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+ if (err)
+ goto out;
+
+ if (next.dentry->d_inode) {
+ dput(path->dentry);
+ mntget(next.mnt);
+ *path = next;
+ goto out;
+ }
+
+ dput(next.dentry);
+ }
+out:
+ if (err)
+ dput(path->dentry);
+ return err;
+}
+
+static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path last = *path;
+ struct path next;
+ int err = 0;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ /* We need to recompute the hash for lower layer lookups */
+ name->hash = full_name_hash(name->name, name->len);
+ if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
+ err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+ name);
+ if (err < 0)
+ goto out;
+ }
+
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ err = lookup_hash(nd, name, &next);
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+ if (err)
+ goto out;
+
+ if (!next.dentry->d_inode) {
+ dput(next.dentry);
+ continue;
+ }
+
+ /* only directories can be part of a union stack */
+ if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
+ dput(next.dentry);
+ break;
+ }
+
+ /* now we know we found something "real" */
+ append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+ last.dentry = next.dentry;
+ last.mnt = mntget(next.mnt);
+ }
+
+ if (last.dentry != path->dentry)
+ path_put(&last);
+out:
+ return err;
+}
+
+static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
+ int res ;
+
+ path_get(&safe);
+ res = __hash_lookup_topmost(nd, name, path);
+ if (res)
+ goto out;
+
+ /* only directories can be part of a union stack */
+ if (!path->dentry->d_inode ||
+ !S_ISDIR(path->dentry->d_inode->i_mode))
+ goto out;
+
+ /* Build the union stack for this part */
+ res = __hash_lookup_build_union(nd, name, path);
+ if (res) {
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ goto out;
+ }
+
+out:
+ path_put(&nd->path);
+ nd->path.dentry = safe.dentry;
+ nd->path.mnt = safe.mnt;
+ return res;
+}
+
static int __lookup_one_len(const char *name, struct qstr *this,
struct dentry *base, int len)
{
@@ -1756,7 +2223,7 @@ struct file *do_filp_open(int dfd, const char *pathname,
if (flag & O_EXCL)
nd.flags |= LOOKUP_EXCL;
mutex_lock(&dir->d_inode->i_mutex);
- error = lookup_hash(&nd, &nd.last, &path);
+ error = hash_lookup_union(&nd, &nd.last, &path);

do_last:
if (error) {
@@ -1920,7 +2387,7 @@ do_link:
}
dir = nd.path.dentry;
mutex_lock(&dir->d_inode->i_mutex);
- error = lookup_hash(&nd, &nd.last, &path);
+ error = hash_lookup_union(&nd, &nd.last, &path);
__putname(nd.last.name);
goto do_last;
}
@@ -1971,7 +2438,7 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
/*
* Do the final lookup.
*/
- err = lookup_hash(nd, &nd->last, &path);
+ err = hash_lookup_union(nd, &nd->last, &path);
if (err) {
path.dentry = ERR_PTR(err);
goto fail;
@@ -2467,7 +2934,7 @@ static long do_rmdir(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- error = lookup_hash(&nd, &nd.last, &path);
+ error = hash_lookup_union(&nd, &nd.last, &path);
if (error)
goto exit2;
error = mnt_want_write(nd.path.mnt);
@@ -2550,7 +3017,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- error = lookup_hash(&nd, &nd.last, &path);
+ error = hash_lookup_union(&nd, &nd.last, &path);
if (!error) {
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
@@ -2954,7 +3421,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,

trap = lock_rename(new_dir, old_dir);

- error = lookup_hash(&oldnd, &oldnd.last, &old);
+ error = hash_lookup_union(&oldnd, &oldnd.last, &old);
if (error)
goto exit3;
/* source must exist */
@@ -2973,7 +3440,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
- error = lookup_hash(&newnd, &newnd.last, &new);
+ error = hash_lookup_union(&newnd, &newnd.last, &new);
if (error)
goto exit4;
/* target should not be an ancestor of source */
diff --git a/include/linux/namei.h b/include/linux/namei.h
index d870ae2..81afb59 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -20,6 +20,7 @@ struct nameidata {
struct qstr last;
struct path root;
unsigned int flags;
+ unsigned int um_flags;
int last_type;
unsigned depth;
char *saved_names[MAX_NESTED_LINKS + 1];
@@ -35,6 +36,9 @@ struct nameidata {
*/
enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};

+#define LAST_UNION 0x01
+#define LAST_LOWLEVEL 0x02
+
/*
* The bitmask for a lookup event:
* - follow links at the end
@@ -49,6 +53,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_CONTINUE 4
#define LOOKUP_PARENT 16
#define LOOKUP_REVAL 64
+#define LOOKUP_TOPMOST 128
+
/*
* Intent data
*/
--
1.6.3.3

2009-10-21 19:21:17

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 25/41] union-mount: stop lookup when directory has S_OPAQUE flag set

From: Jan Blunck <[email protected]>

Honor the S_OPAQUE flag in the union path lookup.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b279686..8ebbf4f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -523,6 +523,9 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = dentry;
last.mnt = mntget(nd->path.mnt);
+
+ if (IS_OPAQUE(last.dentry->d_inode))
+ break;
}

if (last.dentry != path->dentry)
@@ -562,7 +565,8 @@ static int cache_lookup_union(struct nameidata *nd, struct qstr *name,

/* only directories can be part of a union stack */
if (!path->dentry->d_inode ||
- !S_ISDIR(path->dentry->d_inode->i_mode))
+ !S_ISDIR(path->dentry->d_inode->i_mode) ||
+ IS_OPAQUE(path->dentry->d_inode))
goto out;

/* Build the union stack for this part */
@@ -804,6 +808,9 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = next.dentry;
last.mnt = mntget(next.mnt);
+
+ if (IS_OPAQUE(last.dentry->d_inode))
+ break;
}

if (last.dentry != path->dentry)
@@ -825,7 +832,8 @@ static int real_lookup_union(struct nameidata *nd, struct qstr *name,

/* only directories can be part of a union stack */
if (!path->dentry->d_inode ||
- !S_ISDIR(path->dentry->d_inode->i_mode))
+ !S_ISDIR(path->dentry->d_inode->i_mode) ||
+ IS_OPAQUE(path->dentry->d_inode))
goto out;

/* Build the union stack for this part */
@@ -1111,7 +1119,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
{
int err;

- if (IS_MNT_UNION(nd->path.mnt))
+ if (IS_MNT_UNION(nd->path.mnt) && !IS_OPAQUE(nd->path.dentry->d_inode))
goto need_union_lookup;

path->dentry = __d_lookup(nd->path.dentry, name);
@@ -1676,6 +1684,9 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = next.dentry;
last.mnt = mntget(next.mnt);
+
+ if (IS_OPAQUE(last.dentry->d_inode))
+ break;
}

if (last.dentry != path->dentry)
--
1.6.3.3

2009-10-21 19:21:13

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 26/41] union-mount: stop lookup when finding a whiteout

From: Jan Blunck <[email protected]>

Stop the lookup if we find a whiteout during union path lookup.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8ebbf4f..fb463ac 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -442,10 +442,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
path->dentry = dentry;
path->mnt = dentry ? nd->path.mnt : NULL;

- if (!dentry || dentry->d_inode)
+ if (!dentry || (dentry->d_inode || d_is_whiteout(dentry)))
return !dentry;

- /* look for the first non-negative dentry */
+ /* look for the first non-negative or whiteout dentry */

while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
dentry = d_hash_and_lookup(nd->path.dentry, name);
@@ -467,7 +467,7 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
goto out_dput;
}

- if (dentry->d_inode)
+ if (dentry->d_inode || d_is_whiteout(dentry))
goto out_dput;

dput(dentry);
@@ -505,6 +505,11 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
return 1;
}

+ if (d_is_whiteout(dentry)) {
+ dput(dentry);
+ break;
+ }
+
if (!dentry->d_inode) {
dput(dentry);
continue;
@@ -716,7 +721,6 @@ out_unlock:
* type mismatch and whiteouts.
*
* FIXME:
- * - handle DT_WHT
* - handle union stacks in use
* - handle union stacks mounted upon union stacks
* - avoid unnecessary allocations of union locks
@@ -731,7 +735,7 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (err)
return err;

- if (path->dentry->d_inode)
+ if (path->dentry->d_inode || d_is_whiteout(path->dentry))
return 0;

while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
@@ -747,7 +751,7 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (err)
goto out;

- if (next.dentry->d_inode) {
+ if (next.dentry->d_inode || d_is_whiteout(next.dentry)) {
dput(path->dentry);
mntget(next.mnt);
*path = next;
@@ -790,6 +794,11 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
if (err)
goto out;

+ if (d_is_whiteout(next.dentry)) {
+ dput(next.dentry);
+ break;
+ }
+
if (!next.dentry->d_inode) {
dput(next.dentry);
continue;
@@ -1610,7 +1619,7 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (err)
return err;

- if (path->dentry->d_inode)
+ if (path->dentry->d_inode || d_is_whiteout(path->dentry))
return 0;

while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
@@ -1628,7 +1637,7 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (err)
goto out;

- if (next.dentry->d_inode) {
+ if (next.dentry->d_inode || d_is_whiteout(next.dentry)) {
dput(path->dentry);
mntget(next.mnt);
*path = next;
@@ -1666,6 +1675,11 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
if (err)
goto out;

+ if (d_is_whiteout(next.dentry)) {
+ dput(next.dentry);
+ break;
+ }
+
if (!next.dentry->d_inode) {
dput(next.dentry);
continue;
--
1.6.3.3

2009-10-21 19:21:21

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 27/41] union-mount: in-kernel file copy between union mounted filesystems

This patch introduces in-kernel file copy between union mounted
filesystems. When a file is opened for writing but resides on a lower (thus
read-only) layer of the union stack it is copied to the topmost union layer
first.

This patch uses the do_splice() for doing the in-kernel file copy.

XXX - Optimize for non-union mounts in union mount enabled kernels
(esp. call to is_unionized() in do_filp_open()).

XXX - "flags" argument to union_copyup() is unused - bug? Leftover
code?

Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 64 +++++++++-
fs/union.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/union.h | 7 +
3 files changed, 383 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fb463ac..f7ef769 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1050,7 +1050,7 @@ static int __follow_mount(struct path *path)
return res;
}

-static void follow_mount(struct path *path)
+void follow_mount(struct path *path)
{
while (d_mountpoint(path->dentry)) {
struct vfsmount *mounted = lookup_mnt(path);
@@ -1284,6 +1284,21 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
if (err)
break;

+ if ((nd->flags & LOOKUP_TOPMOST) &&
+ (nd->um_flags & LAST_LOWLEVEL)) {
+ struct dentry *dentry;
+
+ dentry = union_create_topmost(nd, &this, &next);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out_dput;
+ }
+ path_put_conditional(&next, nd);
+ next.mnt = nd->path.mnt;
+ next.dentry = dentry;
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ }
+
err = -ENOENT;
inode = next.dentry->d_inode;
if (!inode)
@@ -1333,6 +1348,22 @@ last_component:
err = do_lookup(nd, &this, &next);
if (err)
break;
+
+ if ((nd->flags & LOOKUP_TOPMOST) &&
+ (nd->um_flags & LAST_LOWLEVEL)) {
+ struct dentry *dentry;
+
+ dentry = union_create_topmost(nd, &this, &next);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out_dput;
+ }
+ path_put_conditional(&next, nd);
+ next.mnt = nd->path.mnt;
+ next.dentry = dentry;
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ }
+
inode = next.dentry->d_inode;
if ((lookup_flags & LOOKUP_FOLLOW)
&& inode && inode->i_op->follow_link) {
@@ -1709,7 +1740,7 @@ out:
return err;
}

-static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
+int hash_lookup_union(struct nameidata *nd, struct qstr *name,
struct path *path)
{
struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
@@ -2208,6 +2239,12 @@ struct file *do_filp_open(int dfd, const char *pathname,
&nd, flag);
if (error)
return ERR_PTR(error);
+ if (unlikely(flag & FMODE_WRITE)) {
+ /* Check for union, etc. in union_copyup */
+ error = union_copyup(&nd, flag /* XXX not used */);
+ if (error)
+ return ERR_PTR(error);
+ }
goto ok;
}

@@ -2311,10 +2348,23 @@ do_last:
if (path.dentry->d_inode->i_op->follow_link)
goto do_link;

- path_to_nameidata(&path, &nd);
error = -EISDIR;
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
- goto exit;
+ goto exit_dput;
+
+ /*
+ * If this file is on a lower layer of the union stack, copy it to the
+ * topmost layer before opening it
+ */
+ if (path.dentry->d_inode &&
+ (path.dentry->d_parent != dir) &&
+ S_ISREG(path.dentry->d_inode->i_mode)) {
+ error = __union_copyup(&path, &nd, &path);
+ if (error)
+ goto exit_dput;
+ }
+
+ path_to_nameidata(&path, &nd);
ok:
/*
* Consider:
@@ -3472,6 +3522,12 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOTEMPTY;
if (new.dentry == trap)
goto exit5;
+ /* renaming on unions is done by the user-space */
+ error = -EXDEV;
+ if (is_unionized(oldnd.path.dentry, oldnd.path.mnt))
+ goto exit5;
+ if (is_unionized(newnd.path.dentry, newnd.path.mnt))
+ goto exit5;

error = mnt_want_write(oldnd.path.mnt);
if (error)
diff --git a/fs/union.c b/fs/union.c
index 341fc03..de31fc9 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -21,6 +21,14 @@
#include <linux/mount.h>
#include <linux/fs_struct.h>
#include <linux/union.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/quotaops.h>
+#include <linux/dnotify.h>
+#include <linux/security.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/splice.h>

/*
* This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -337,6 +345,314 @@ int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
}

/*
+ * Union mount copyup support
+ */
+
+extern int hash_lookup_union(struct nameidata *, struct qstr *, struct path *);
+extern void follow_mount(struct path *path);
+
+/*
+ * union_relookup_topmost - lookup and create the topmost path to dentry
+ * @nd: pointer to nameidata
+ * @flags: lookup flags
+ */
+static int union_relookup_topmost(struct nameidata *nd, int flags)
+{
+ int err;
+ char *kbuf, *name;
+ struct nameidata this;
+
+ kbuf = (char *)__get_free_page(GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ name = d_path(&nd->path, kbuf, PAGE_SIZE);
+ err = PTR_ERR(name);
+ if (IS_ERR(name))
+ goto free_page;
+
+ err = path_lookup(name, flags|LOOKUP_CREATE|LOOKUP_TOPMOST, &this);
+ if (err)
+ goto free_page;
+
+ path_put(&nd->path);
+ nd->path.dentry = this.path.dentry;
+ nd->path.mnt = this.path.mnt;
+
+ /*
+ * the nd->flags should be unchanged
+ */
+ BUG_ON(this.um_flags & LAST_LOWLEVEL);
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ free_page:
+ free_page((unsigned long)kbuf);
+ return err;
+}
+
+/*
+ * union_create_topmost - create the topmost path component
+ * @nd: pointer to nameidata of the base directory
+ * @name: pointer to file name
+ * @path: pointer to path of the overlaid file
+ *
+ * This is called by __link_path_walk() to create the directories on a path
+ * when it is called with LOOKUP_TOPMOST.
+ */
+struct dentry *union_create_topmost(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct dentry *dentry, *parent = nd->path.dentry;
+ int res, mode = path->dentry->d_inode->i_mode;
+
+ if (parent->d_sb == path->dentry->d_sb)
+ return ERR_PTR(-EEXIST);
+
+ res = mnt_want_write(nd->path.mnt);
+ if (res)
+ return ERR_PTR(res);
+
+ mutex_lock(&parent->d_inode->i_mutex);
+ dentry = lookup_one_len(name->name, nd->path.dentry, name->len);
+ if (IS_ERR(dentry))
+ goto out_unlock;
+
+ switch (mode & S_IFMT) {
+ case S_IFREG:
+ /*
+ * FIXME: Does this make any sense in this case?
+ * Special case - lookup gave negative, but... we had foo/bar/
+ * From the vfs_mknod() POV we just have a negative dentry -
+ * all is fine. Let's be bastards - you had / on the end,you've
+ * been asking for (non-existent) directory. -ENOENT for you.
+ */
+ if (name->name[name->len] && !dentry->d_inode) {
+ dput(dentry);
+ dentry = ERR_PTR(-ENOENT);
+ goto out_unlock;
+ }
+
+ res = vfs_create(parent->d_inode, dentry, mode, nd);
+ if (res) {
+ dput(dentry);
+ dentry = ERR_PTR(res);
+ goto out_unlock;
+ }
+ break;
+ case S_IFDIR:
+ res = vfs_mkdir(parent->d_inode, dentry, mode);
+ if (res) {
+ dput(dentry);
+ dentry = ERR_PTR(res);
+ goto out_unlock;
+ }
+
+ res = append_to_union(nd->path.mnt, dentry, path->mnt,
+ path->dentry);
+ if (res) {
+ dput(dentry);
+ dentry = ERR_PTR(res);
+ goto out_unlock;
+ }
+ break;
+ default:
+ dput(dentry);
+ dentry = ERR_PTR(-EINVAL);
+ goto out_unlock;
+ }
+
+ out_unlock:
+ mutex_unlock(&parent->d_inode->i_mutex);
+ mnt_drop_write(nd->path.mnt);
+ return dentry;
+}
+
+static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
+ struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+ int ret;
+ size_t size;
+ loff_t offset;
+ struct file *old_file, *new_file;
+ const struct cred *cred = current_cred();
+
+ dget(old_dentry);
+ mntget(old_mnt);
+ old_file = dentry_open(old_dentry, old_mnt, O_RDONLY, cred);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ dget(new_dentry);
+ mntget(new_mnt);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY, cred);
+ ret = PTR_ERR(new_file);
+ if (IS_ERR(new_file))
+ goto fput_old;
+
+ size = i_size_read(old_file->f_path.dentry->d_inode);
+ if (((size_t)size != size) || ((ssize_t)size != size)) {
+ ret = -EFBIG;
+ goto fput_new;
+ }
+
+ offset = 0;
+ ret = do_splice_direct(old_file, &offset, new_file, size,
+ SPLICE_F_MOVE);
+ if (ret >= 0)
+ ret = 0;
+ fput_new:
+ fput(new_file);
+ fput_old:
+ fput(old_file);
+ return ret;
+}
+
+/**
+ * __union_copyup - copy a file to the topmost directory
+ * @old: pointer to path of the old file name
+ * @new_nd: pointer to nameidata of the topmost directory
+ * @new: pointer to path of the new file name
+ *
+ * The topmost directory @new_nd must already be locked. Creates the topmost
+ * file if it doesn't exist yet.
+ */
+int __union_copyup(struct path *old, struct nameidata *new_nd, struct path *new)
+{
+ struct dentry *dentry;
+ int error;
+
+ /* Maybe this should be -EINVAL */
+ if (S_ISDIR(old->dentry->d_inode->i_mode))
+ return -EISDIR;
+
+ if (new_nd->path.dentry != new->dentry->d_parent) {
+ mutex_lock(&new_nd->path.dentry->d_inode->i_mutex);
+ dentry = lookup_one_len(new->dentry->d_name.name,
+ new_nd->path.dentry,
+ new->dentry->d_name.len);
+ mutex_unlock(&new_nd->path.dentry->d_inode->i_mutex);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ error = -EEXIST;
+ if (dentry->d_inode)
+ goto out_dput;
+ } else
+ dentry = dget(new->dentry);
+
+ error = mnt_want_write(new_nd->path.mnt);
+ if (error)
+ goto out_dput;
+
+ if (!dentry->d_inode) {
+ error = vfs_create(new_nd->path.dentry->d_inode, dentry,
+ old->dentry->d_inode->i_mode, new_nd);
+ if (error)
+ goto out_drop_write;
+ }
+
+ BUG_ON(!S_ISREG(old->dentry->d_inode->i_mode));
+ error = union_copy_file(old->dentry, old->mnt, dentry,
+ new_nd->path.mnt);
+ if (error) {
+ /* FIXME: are there return value we should not
+ * BUG() on ? */
+ BUG_ON(vfs_unlink(new_nd->path.dentry->d_inode,
+ dentry));
+ goto out_drop_write;
+ }
+
+ mnt_drop_write(new_nd->path.mnt);
+ dput(new->dentry);
+ new->dentry = dentry;
+ if (new->mnt != new_nd->path.mnt)
+ mntput(new->mnt);
+ new->mnt = new_nd->path.mnt;
+ return error;
+
+out_drop_write:
+ mnt_drop_write(new_nd->path.mnt);
+out_dput:
+ dput(dentry);
+ return error;
+}
+
+/*
+ * union_copyup - copy a file to the topmost layer of the union stack
+ * @nd: nameidata pointer to the file
+ * @flags: flags given to open_namei
+ */
+int union_copyup(struct nameidata *nd, int flags)
+{
+ struct qstr this;
+ char *name;
+ struct dentry *dir;
+ struct path path;
+ int err;
+
+ if (!is_unionized(nd->path.dentry, nd->path.mnt))
+ return 0;
+ if (!S_ISREG(nd->path.dentry->d_inode->i_mode))
+ return 0;
+
+ /* safe the name for hash_lookup_union() */
+ this.len = nd->path.dentry->d_name.len;
+ this.hash = nd->path.dentry->d_name.hash;
+ name = kmalloc(this.len + 1, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+ this.name = name;
+ memcpy(name, nd->path.dentry->d_name.name, nd->path.dentry->d_name.len);
+ name[this.len] = 0;
+
+ err = union_relookup_topmost(nd, nd->flags|LOOKUP_PARENT);
+ if (err) {
+ kfree(name);
+ return err;
+ }
+ nd->flags &= ~LOOKUP_PARENT;
+
+ dir = nd->path.dentry;
+ mutex_lock(&dir->d_inode->i_mutex);
+ err = hash_lookup_union(nd, &this, &path);
+ mutex_unlock(&dir->d_inode->i_mutex);
+ kfree(name);
+ if (err)
+ return err;
+
+ err = -ENOENT;
+ if (!path.dentry->d_inode)
+ goto exit_dput;
+
+ /* Necessary?! I guess not ... */
+ follow_mount(&path);
+
+ err = -ENOENT;
+ if (!path.dentry->d_inode)
+ goto exit_dput;
+
+ err = -EISDIR;
+ if (!S_ISREG(path.dentry->d_inode->i_mode))
+ goto exit_dput;
+
+ if (path.dentry->d_parent != nd->path.dentry) {
+ err = __union_copyup(&path, nd, &path);
+ if (err)
+ goto exit_dput;
+ }
+
+ dput(nd->path.dentry);
+ if (nd->path.mnt != path.mnt)
+ mntput(nd->path.mnt);
+ nd->path = path;
+ return 0;
+
+exit_dput:
+ dput(path.dentry);
+ if (path.mnt != nd->path.mnt)
+ mntput(path.mnt);
+ return err;
+}
+
+/*
* This must be called when unhashing a dentry. This is called with dcache_lock
* and unhashes all unions this dentry is in.
*/
diff --git a/include/linux/union.h b/include/linux/union.h
index 0b6f356..405baa9 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -53,6 +53,10 @@ extern void __shrink_d_unions(struct dentry *, struct list_head *);
extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
struct dentry *);
extern void detach_mnt_union(struct vfsmount *);
+extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
+ struct path *);
+extern int __union_copyup(struct path *, struct nameidata *, struct path *);
+extern int union_copyup(struct nameidata *, int);

#else /* CONFIG_UNION_MOUNT */

@@ -67,6 +71,9 @@ extern void detach_mnt_union(struct vfsmount *);
#define __shrink_d_unions(x,y) do { } while (0)
#define attach_mnt_union(x, y, z) do { } while (0)
#define detach_mnt_union(x) do { } while (0)
+#define union_create_topmost(x, y, z) ({ BUG(); (NULL); })
+#define __union_copyup(x, y, z) ({ BUG(); (0); })
+#define union_copyup(x, y) ({ (0); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2009-10-21 19:25:28

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 28/41] union-mount: call do_whiteout() on unlink and rmdir

From: Jan Blunck <[email protected]>

Call do_whiteout() when removing files and directories.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f7ef769..1f2a214 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2898,6 +2898,10 @@ static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
if (err)
goto out;

+ err = -ENOENT;
+ if (!dentry->d_inode)
+ goto out;
+
err = -ENOTEMPTY;
if (isdir && !directory_is_empty(path->dentry, path->mnt))
goto out;
@@ -3012,6 +3016,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = hash_lookup_union(&nd, &nd.last, &path);
if (error)
goto exit2;
+ if (is_unionized(nd.path.dentry, nd.path.mnt)) {
+ error = do_whiteout(&nd, &path, 1);
+ goto exit3;
+ }
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit3;
@@ -3100,6 +3108,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
inode = path.dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ if (is_unionized(nd.path.dentry, nd.path.mnt)) {
+ error = do_whiteout(&nd, &path, 0);
+ goto exit2;
+ }
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
--
1.6.3.3

2009-10-21 19:24:42

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 29/41] union-mount: Always create topmost directory on open

When we open a directory, always create a matching directory on the
top-level. This way we don't have to go back and create all the
directories on the path to an element when we want to copy it up.

XXX - Turn into #ifdef'able function

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 34 ++++++++++++++++++++++++++++++----
1 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1f2a214..8d95eb1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1284,8 +1284,31 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
if (err)
break;

- if ((nd->flags & LOOKUP_TOPMOST) &&
- (nd->um_flags & LAST_LOWLEVEL)) {
+ /*
+ * We want to create this element on the top level
+ * file system in two cases:
+ *
+ * - We are specifically told to - LOOKUP_TOPMOST.
+ * - This is a directory, and it does not yet exist on
+ * the top level. Various tricks only work if
+ * directories always exist on the top level.
+ *
+ * In either case, only create this element on the top
+ * level if the last element is located on the lower
+ * level. If the last element is located on the top
+ * level, then every single element in the path
+ * already exists on the top level.
+ *
+ * Note that we can assume that the parent is on the
+ * top level since we always create the directory on
+ * the top level.
+ */
+
+ if ((nd->um_flags & LAST_LOWLEVEL) &&
+ ((next.dentry->d_inode &&
+ S_ISDIR(next.dentry->d_inode->i_mode) &&
+ (nd->path.mnt != next.mnt)) ||
+ (nd->flags & LOOKUP_TOPMOST))) {
struct dentry *dentry;

dentry = union_create_topmost(nd, &this, &next);
@@ -1349,8 +1372,11 @@ last_component:
if (err)
break;

- if ((nd->flags & LOOKUP_TOPMOST) &&
- (nd->um_flags & LAST_LOWLEVEL)) {
+ if ((nd->um_flags & LAST_LOWLEVEL) &&
+ ((next.dentry->d_inode &&
+ S_ISDIR(next.dentry->d_inode->i_mode) &&
+ (nd->path.mnt != next.mnt)) ||
+ (nd->flags & LOOKUP_TOPMOST))) {
struct dentry *dentry;

dentry = union_create_topmost(nd, &this, &next);
--
1.6.3.3

2009-10-21 19:21:24

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 30/41] fallthru: Basic fallthru definitions

Define the fallthru dcache flag and file system op.

Signed-off-by: Valerie Aurora <[email protected]>
---
include/linux/dcache.h | 6 ++++++
include/linux/fs.h | 1 +
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 730c432..a55f79f 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -193,6 +193,7 @@ d_iput: no no no yes

#define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
#define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */
+#define DCACHE_FALLTHRU 0x0100 /* Keep looking in the file system below */

#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */

@@ -381,6 +382,11 @@ static inline int d_is_whiteout(struct dentry *dentry)
return (dentry->d_flags & DCACHE_WHITEOUT);
}

+static inline int d_is_fallthru(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_FALLTHRU);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index efea78c..57690ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1532,6 +1532,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2009-10-21 19:21:39

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 31/41] fallthru: Support for fallthru entries in union mount lookup

A fallthru directory entry overrides the opaque flag for its parent
directory (for this directory entry only). Before, we stopped
building the union stack when we encountered an opaque directory; now
we include directories below opaque directories in the union stack and
check for opacity during lookup.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 7 +++----
fs/namei.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------------
2 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d80a3bb..ca8a661 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1086,7 +1086,7 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
if (inode) {
- dentry->d_flags &= ~DCACHE_WHITEOUT;
+ dentry->d_flags &= ~(DCACHE_WHITEOUT|DCACHE_FALLTHRU);
list_add(&dentry->d_alias, &inode->i_dentry);
}
dentry->d_inode = inode;
@@ -1638,9 +1638,8 @@ void d_delete(struct dentry * dentry)

static void __d_rehash(struct dentry * entry, struct hlist_head *list)
{
-
- entry->d_flags &= ~DCACHE_UNHASHED;
- hlist_add_head_rcu(&entry->d_hash, list);
+ entry->d_flags &= ~DCACHE_UNHASHED;
+ hlist_add_head_rcu(&entry->d_hash, list);
}

static void _d_rehash(struct dentry * entry)
diff --git a/fs/namei.c b/fs/namei.c
index 8d95eb1..61e94aa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -416,6 +416,28 @@ static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
return dentry;
}

+/*
+ * Theory of operation for opaque, whiteout, and fallthru:
+ *
+ * whiteout: Unconditionally stop lookup here - ENOENT
+ *
+ * opaque: Don't lookup in directories lower in the union stack
+ *
+ * fallthru: While looking up an entry, ignore the opaque flag for the
+ * current directory only.
+ *
+ * A union stack is a linked list of directory dentries which appear
+ * in the same place in the namespace. When constructing the union
+ * stack, we include directories below opaque directories so that we
+ * can properly handle fallthrus. All non-fallthru lookups have to
+ * check for the opaque flag on the parent directory and obey it.
+ *
+ * In general, the code pattern is to lookup the the topmost entry
+ * first (either the first visible non-negative dentry or a negative
+ * dentry in the topmost layer of the union), then build the union
+ * stack for the newly looked-up entry (if it is a directory).
+ */
+
/**
* __cache_lookup_topmost - lookup the topmost (non-)negative dentry
*
@@ -445,6 +467,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (!dentry || (dentry->d_inode || d_is_whiteout(dentry)))
return !dentry;

+ /* Keep going through opaque directories if we found a fallthru */
+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(dentry))
+ return !dentry;
+
/* look for the first non-negative or whiteout dentry */

while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
@@ -470,6 +496,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (dentry->d_inode || d_is_whiteout(dentry))
goto out_dput;

+ /* Stop the lookup on opaque parent and non-fallthru child */
+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(dentry))
+ goto out_dput;
+
dput(dentry);
}

@@ -528,9 +558,6 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = dentry;
last.mnt = mntget(nd->path.mnt);
-
- if (IS_OPAQUE(last.dentry->d_inode))
- break;
}

if (last.dentry != path->dentry)
@@ -570,8 +597,7 @@ static int cache_lookup_union(struct nameidata *nd, struct qstr *name,

/* only directories can be part of a union stack */
if (!path->dentry->d_inode ||
- !S_ISDIR(path->dentry->d_inode->i_mode) ||
- IS_OPAQUE(path->dentry->d_inode))
+ !S_ISDIR(path->dentry->d_inode->i_mode))
goto out;

/* Build the union stack for this part */
@@ -738,6 +764,9 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (path->dentry->d_inode || d_is_whiteout(path->dentry))
return 0;

+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(path->dentry))
+ return 0;
+
while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
name->hash = full_name_hash(name->name, name->len);
if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
@@ -758,6 +787,9 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
goto out;
}

+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(next.dentry))
+ goto out;
+
dput(next.dentry);
}
out:
@@ -817,9 +849,6 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = next.dentry;
last.mnt = mntget(next.mnt);
-
- if (IS_OPAQUE(last.dentry->d_inode))
- break;
}

if (last.dentry != path->dentry)
@@ -841,8 +870,7 @@ static int real_lookup_union(struct nameidata *nd, struct qstr *name,

/* only directories can be part of a union stack */
if (!path->dentry->d_inode ||
- !S_ISDIR(path->dentry->d_inode->i_mode) ||
- IS_OPAQUE(path->dentry->d_inode))
+ !S_ISDIR(path->dentry->d_inode->i_mode))
goto out;

/* Build the union stack for this part */
@@ -1128,7 +1156,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
{
int err;

- if (IS_MNT_UNION(nd->path.mnt) && !IS_OPAQUE(nd->path.dentry->d_inode))
+ if (IS_MNT_UNION(nd->path.mnt))
goto need_union_lookup;

path->dentry = __d_lookup(nd->path.dentry, name);
@@ -1679,6 +1707,9 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
if (path->dentry->d_inode || d_is_whiteout(path->dentry))
return 0;

+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(path->dentry))
+ return 0;
+
while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
name->hash = full_name_hash(name->name, name->len);
if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
@@ -1701,6 +1732,9 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
goto out;
}

+ if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(next.dentry))
+ goto out;
+
dput(next.dentry);
}
out:
@@ -1755,9 +1789,6 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
path_put(&last);
last.dentry = next.dentry;
last.mnt = mntget(next.mnt);
-
- if (IS_OPAQUE(last.dentry->d_inode))
- break;
}

if (last.dentry != path->dentry)
--
1.6.3.3

2009-10-21 19:21:43

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 32/41] fallthru: ext2 fallthru support

Add support for fallthru directory entries to ext2.

XXX - Makes up inode number for fallthru entry
XXX - Might be better implemented as special symlinks

Cc: Theodore Tso <[email protected]>
Cc: [email protected]
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 1 +
fs/ext2/namei.c | 20 ++++++++++
include/linux/ext2_fs.h | 1 +
4 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d4628c0..2665bc6 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode && (de->file_type != EXT2_FT_WHT))
+ if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU)))
return 0;
return !memcmp(name, de->name, len);
}
@@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
[EXT2_FT_WHT] = DT_WHT,
+ [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
};

#define S_SHIFT 12
@@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
ext2_put_page(page);
return 0;
}
+ } else if (de->file_type == EXT2_FT_FALLTHRU) {
+ int over;
+ unsigned char d_type = DT_UNKNOWN;
+
+ offset = (char *)de - kaddr;
+ /* XXX We don't know the inode number
+ * of the directory entry in the
+ * underlying file system. Should
+ * look it up, either on fallthru
+ * creation at first readdir or now at
+ * filldir time. */
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ 123 /* Made up ino */, d_type);
+ if (over) {
+ ext2_put_page(page);
+ return 0;
+ }
}
filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
@@ -463,6 +483,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&dentry->d_lock);
+ } else if(!res && de->file_type == EXT2_FT_FALLTHRU) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
}
ext2_put_page(page);
}
@@ -532,6 +556,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
+ de->file_type = 0;
goto got_it;
}
if (de->rec_len == 0) {
@@ -545,6 +570,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU) &&
(rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
@@ -587,7 +613,8 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)

err = -EEXIST;
if (ext2_match (namelen, name, de)) {
- if (de->file_type == EXT2_FT_WHT)
+ if ((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU))
goto got_it;
goto out_unlock;
}
@@ -602,7 +629,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
!ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
@@ -627,6 +655,60 @@ out_unlock:
}

/*
+ * Create a fallthru entry.
+ */
+int ext2_fallthru_entry (struct inode *dir, struct dentry *dentry)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ if (de->inode || (de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_FALLTHRU;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
+/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
*/
@@ -711,7 +793,9 @@ int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
*/
if (ext2_match (namelen, name, de))
de->inode = 0;
- if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index a7f057f..328fc1c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -108,6 +108,7 @@ extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *,
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
extern int ext2_whiteout_entry (struct inode *, struct dentry *,
struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_fallthru_entry (struct inode *, struct dentry *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 9c4eef2..2ac44f1 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -333,6 +333,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
goto out;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
@@ -351,6 +352,24 @@ out:
return err;
}

+/*
+ * Create a fallthru entry.
+ */
+static int ext2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ int err;
+
+ err = ext2_fallthru_entry(dir, dentry);
+ if (err)
+ return err;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -451,6 +470,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
.whiteout = ext2_whiteout,
+ .fallthru = ext2_fallthru,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index bd10826..f6b68ec 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -577,6 +577,7 @@ enum {
EXT2_FT_SOCK,
EXT2_FT_SYMLINK,
EXT2_FT_WHT,
+ EXT2_FT_FALLTHRU,
EXT2_FT_MAX
};

--
1.6.3.3

2009-10-21 19:23:52

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 33/41] fallthru: jffs2 fallthru support

From: Felix Fietkau <[email protected]>

Add support for fallthru dentries to jffs2.

Cc: David Woodhouse <[email protected]>
Cc: [email protected]
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/jffs2/dir.c | 31 ++++++++++++++++++++++++++++++-
include/linux/jffs2.h | 6 ++++++
2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 46a2e1b..544d6c5 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -35,6 +35,7 @@ static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+static int jffs2_fallthru (struct inode *, struct dentry *);

const struct file_operations jffs2_dir_operations =
{
@@ -57,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rmdir = jffs2_rmdir,
.mknod = jffs2_mknod,
.rename = jffs2_rename,
+ .fallthru = jffs2_fallthru,
.whiteout = jffs2_whiteout,
.permission = jffs2_permission,
.setattr = jffs2_setattr,
@@ -107,6 +109,9 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
case DT_WHT:
target->d_flags |= DCACHE_WHITEOUT;
break;
+ case JFFS2_DT_FALLTHRU:
+ target->d_flags |= DCACHE_FALLTHRU;
+ break;
default:
ino = fd->ino;
break;
@@ -168,7 +173,10 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
fd->name, fd->ino, fd->type, curofs, offset));
continue;
}
- if (!fd->ino) {
+ if (fd->type == JFFS2_DT_FALLTHRU)
+ /* XXX Should really do a lookup for the real inode number here */
+ fd->ino = 100;
+ else if (!fd->ino && (fd->type != DT_WHT)) {
D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
offset++;
continue;
@@ -797,6 +805,26 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return 0;
}

+static int jffs2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ uint32_t now;
+ int ret;
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_UNKNOWN,
+ dentry->d_name.name, dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+
+ return 0;
+}
+
static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
{
@@ -830,6 +858,7 @@ static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
return ret;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 65533bb..dbe8c93 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -114,6 +114,12 @@ struct jffs2_unknown_node
jint32_t hdr_crc;
};

+/*
+ * Non-standard directory entry type(s), for on-disk use
+ */
+
+#define JFFS2_DT_FALLTHRU (DT_WHT + 1)
+
struct jffs2_raw_dirent
{
jint16_t magic;
--
1.6.3.3

2009-10-21 19:23:56

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 34/41] fallthru: tmpfs fallthru support

Add support for fallthru directory entries to tmpfs

XXX - Makes up inode number for dirent

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 3 +-
fs/libfs.c | 21 +++++++++++++++++--
mm/shmem.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca8a661..8ef2d89 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2292,7 +2292,8 @@ resume:
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
if (d_unhashed(dentry)||(!dentry->d_inode &&
- !d_is_whiteout(dentry)))
+ !d_is_whiteout(dentry) &&
+ !d_is_fallthru(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/fs/libfs.c b/fs/libfs.c
index dcec3d3..01f3e73 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -133,6 +133,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *cursor = filp->private_data;
struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
+ int d_type;
int i = filp->f_pos;

switch (i) {
@@ -158,14 +159,28 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (d_unhashed(next) || !next->d_inode)
+ if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
continue;

+ if (d_is_fallthru(next)) {
+ /* XXX We don't know the inode
+ * number of the directory
+ * entry in the underlying
+ * file system. Should look
+ * it up, either on fallthru
+ * creation at first readdir
+ * or now at filldir time. */
+ ino = 123; /* Made up ino */
+ d_type = DT_UNKNOWN;
+ } else {
+ ino = next->d_inode->i_ino;
+ d_type = dt_type(next->d_inode);
+ }
+
spin_unlock(&dcache_lock);
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
- next->d_inode->i_ino,
- dt_type(next->d_inode)) < 0)
+ ino, d_type) < 0)
return 0;
spin_lock(&dcache_lock);
/* next is still alive */
diff --git a/mm/shmem.c b/mm/shmem.c
index 2faa14b..4f4b4b6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1798,8 +1798,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
static int shmem_unlink(struct inode *dir, struct dentry *dentry);

/*
- * This is the whiteout support for tmpfs. It uses one singleton whiteout
- * inode per superblock thus it is very similar to shmem_link().
+ * Create a dentry to signify a whiteout.
*/
static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
@@ -1830,8 +1829,10 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
spin_unlock(&sbinfo->stat_lock);
}

- if (old_dentry->d_inode) {
- if (S_ISDIR(old_dentry->d_inode->i_mode))
+ if (old_dentry->d_inode || d_is_fallthru(old_dentry)) {
+ /* A fallthru for a dir is treated like a regular link */
+ if (old_dentry->d_inode &&
+ S_ISDIR(old_dentry->d_inode->i_mode))
shmem_rmdir(dir, old_dentry);
else
shmem_unlink(dir, old_dentry);
@@ -1848,6 +1849,48 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
}

static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode);
+
+/*
+ * Create a dentry to signify a fallthru. A fallthru in tmpfs is the
+ * logical equivalent of an in-kernel readdir() cache. It can't be
+ * deleted until the file system is unmounted.
+ */
+static int shmem_fallthru(struct inode *dir, struct dentry *dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+
+ /* FIXME: this is stupid */
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ if (dentry->d_inode || d_is_fallthru(dentry) || d_is_whiteout(dentry))
+ return -EEXIST;
+
+ /*
+ * Each new link needs a new dentry, pinning lowmem, and tmpfs
+ * dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ shmem_d_instantiate(dir, dentry, NULL);
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
if (d_is_whiteout(dentry)) {
@@ -1855,14 +1898,15 @@ static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
shmem_free_inode(dir->i_sb);
if (S_ISDIR(inode->i_mode))
inode->i_mode |= S_OPAQUE;
+ } else if (d_is_fallthru(dentry)) {
+ shmem_free_inode(dir->i_sb);
} else {
/* New dentry */
dir->i_size += BOGO_DIRENT_SIZE;
dget(dentry); /* Extra count - pin the dentry in core */
}
- /* Will clear DCACHE_WHITEOUT flag */
+ /* Will clear DCACHE_WHITEOUT and DCACHE_FALLTHRU flags */
d_instantiate(dentry, inode);
-
}
/*
* File creation. Allocate an inode, and we're done..
@@ -1947,7 +1991,8 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ if (d_is_whiteout(dentry) || d_is_fallthru(dentry) ||
+ (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
shmem_free_inode(dir->i_sb);

if (inode) {
@@ -2583,6 +2628,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.mknod = shmem_mknod,
.rename = shmem_rename,
.whiteout = shmem_whiteout,
+ .fallthru = shmem_fallthru,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2009-10-21 19:21:47

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 35/41] union-mount: Copy up directory entries on first readdir()

readdir() in union mounts is implemented by copying up all visible
directory entries from the lower level directories to the topmost
directory. Directory entries that refer to lower level file system
objects are marked as "fallthru" in the topmost directory.

Thanks to Felix Fietkau <[email protected]> for a bug fix.

XXX - Do we need i_mutex on lower layer?
XXX - Rewrite for two layers only?

Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Felix Fietkau <[email protected]>
---
fs/readdir.c | 17 +++++
fs/union.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/union.h | 2 +
3 files changed, 190 insertions(+), 0 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 3a48491..cfeacd8 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,6 +16,8 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/unistd.h>
+#include <linux/union.h>
+#include <linux/mount.h>

#include <asm/uaccess.h>

@@ -36,9 +38,24 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
+ /*
+ * XXX Think harder about locking for
+ * union_copyup_dir. Currently we lock the topmost
+ * directory and hold that lock while sequentially
+ * acquiring and dropping locks for the directories
+ * below this one in the union stack.
+ */
+ if (is_unionized(file->f_path.dentry, file->f_path.mnt) &&
+ !IS_OPAQUE(inode) && IS_MNT_UNION(file->f_path.mnt)) {
+ res = union_copyup_dir(&file->f_path);
+ if (res)
+ goto out_unlock;
+ }
+
res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
+out_unlock:
mutex_unlock(&inode->i_mutex);
out:
return res;
diff --git a/fs/union.c b/fs/union.c
index de31fc9..d56b829 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -5,6 +5,7 @@
* Copyright (C) 2007-2009 Novell Inc.
*
* Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the Free
@@ -777,3 +778,173 @@ void detach_mnt_union(struct vfsmount *mnt)
union_put(um);
return;
}
+
+/**
+ * union_copyup_dir_one - copy up a single directory entry
+ *
+ * Individual directory entry copyup function for union_copyup_dir.
+ * We get the entries from higher level layers first.
+ */
+
+static int union_copyup_dir_one(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct dentry *topmost_dentry = (struct dentry *) buf;
+ struct dentry *dentry;
+ int err = 0;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ /* Lookup this entry in the topmost directory */
+ dentry = lookup_one_len(name, topmost_dentry, namlen);
+
+ if (IS_ERR(dentry)) {
+ printk(KERN_INFO "error looking up %s\n", dentry->d_name.name);
+ goto out;
+ }
+
+ /*
+ * If the entry already exists, one of the following is true:
+ * it was already copied up (due to an earlier lookup), an
+ * entry with the same name already exists on the topmost file
+ * system, it is a whiteout, or it is a fallthru. In each
+ * case, the top level entry masks any entries from lower file
+ * systems, so don't copy up this entry.
+ */
+ if (dentry->d_inode || d_is_whiteout(dentry) ||
+ d_is_fallthru(dentry)) {
+ printk(KERN_INFO "skipping copy of %s\n", dentry->d_name.name);
+ goto out_dput;
+ }
+
+ /*
+ * If the entry doesn't exist, create a fallthru entry in the
+ * topmost file system. All possible directory types are
+ * used, so each file system must implement its own way of
+ * storing a fallthru entry.
+ */
+ printk(KERN_INFO "creating fallthru for %s\n", dentry->d_name.name);
+ err = topmost_dentry->d_inode->i_op->fallthru(topmost_dentry->d_inode,
+ dentry);
+ /* FIXME */
+ BUG_ON(err);
+ /*
+ * At this point, we have a negative dentry marked as fallthru
+ * in the cache. We could potentially lookup the entry lower
+ * level file system and turn this into a positive dentry
+ * right now, but it is not clear that would be a performance
+ * win and adds more opportunities to fail.
+ */
+out_dput:
+ dput(dentry);
+out:
+ return 0;
+}
+
+/**
+ * union_copyup_dir - copy up low-level directory entries to topmost dir
+ *
+ * readdir() is difficult to support on union file systems for two
+ * reasons: We must eliminate duplicates and apply whiteouts, and we
+ * must return something in f_pos that lets us restart in the same
+ * place when we return. Our solution is to, on first readdir() of
+ * the directory, copy up all visible entries from the low-level file
+ * systems and mark the entries that refer to low-level file system
+ * objects as "fallthru" entries.
+ */
+
+int union_copyup_dir(struct path *topmost_path)
+{
+ struct dentry *topmost_dentry = topmost_path->dentry;
+ struct path path = *topmost_path;
+ int res = 0;
+
+ /*
+ * Skip opaque dirs.
+ */
+ if (IS_OPAQUE(topmost_dentry->d_inode))
+ return 0;
+
+ res = mnt_want_write(topmost_path->mnt);
+ if (res)
+ return res;
+
+ /*
+ * Mark this dir opaque to show that we have already copied up
+ * the lower entries. Only fallthru entries pass through to
+ * the underlying file system.
+ *
+ * XXX Deal with the lower file system changing. This could
+ * be through running a tool over the top level file system to
+ * make directories transparent again, or we could check the
+ * mtime of the underlying directory.
+ */
+
+ topmost_dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(topmost_dentry->d_inode);
+
+ /*
+ * Loop through each dir on each level copying up the entries
+ * to the topmost.
+ */
+
+ /* Don't drop the caller's reference to the topmost path */
+ path_get(&path);
+ while (follow_union_down(&path.mnt, &path.dentry)) {
+ struct file * ftmp;
+ struct inode * inode;
+
+ /* XXX Permit fallthrus on lower-level? Would need to
+ * pass in opaque flag to union_copyup_dir_one() and
+ * only copy up fallthru entries there. We allow
+ * fallthrus in lower level opaque directories on
+ * lookup, so for consistency we should do one or the
+ * other in both places. */
+ if (IS_OPAQUE(path.dentry->d_inode))
+ break;
+
+ /* dentry_open() doesn't get a path reference itself */
+ path_get(&path);
+ ftmp = dentry_open(path.dentry, path.mnt,
+ O_RDONLY | O_DIRECTORY | O_NOATIME,
+ current_cred());
+ if (IS_ERR(ftmp)) {
+ printk (KERN_ERR "unable to open dir %s for "
+ "directory copyup: %ld\n",
+ path.dentry->d_name.name, PTR_ERR(ftmp));
+ continue;
+ }
+
+ inode = path.dentry->d_inode;
+ mutex_lock(&inode->i_mutex);
+
+ res = -ENOENT;
+ if (IS_DEADDIR(inode))
+ goto out_fput;
+ /*
+ * Read the whole directory, calling our directory
+ * entry copyup function on each entry. Pass in the
+ * topmost dentry as our private data so we can create
+ * new entries in the topmost directory.
+ */
+ res = ftmp->f_op->readdir(ftmp, topmost_dentry,
+ union_copyup_dir_one);
+out_fput:
+ mutex_unlock(&inode->i_mutex);
+ fput(ftmp);
+
+ if (res)
+ break;
+ }
+ path_put(&path);
+ mnt_drop_write(topmost_path->mnt);
+ return res;
+}
diff --git a/include/linux/union.h b/include/linux/union.h
index 405baa9..a0656b3 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -57,6 +57,7 @@ extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
struct path *);
extern int __union_copyup(struct path *, struct nameidata *, struct path *);
extern int union_copyup(struct nameidata *, int);
+extern int union_copyup_dir(struct path *path);

#else /* CONFIG_UNION_MOUNT */

@@ -74,6 +75,7 @@ extern int union_copyup(struct nameidata *, int);
#define union_create_topmost(x, y, z) ({ BUG(); (NULL); })
#define __union_copyup(x, y, z) ({ BUG(); (0); })
#define union_copyup(x, y) ({ (0); })
+#define union_copyup_dir(x) ({ BUG(); (0); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2009-10-21 19:24:13

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 36/41] union-mount: Increment read-only users count for read-only layer

Union mounts want to guarantee that the read-only layer is read-only -
and stays read-only. Use the new superblock read-only user count.

XXX - Put common code in loopback and regular mounts in a function

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 0280e5b..505974a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1111,6 +1111,11 @@ static int do_umount(struct vfsmount *mnt, int flags)
spin_unlock(&vfsmount_lock);
if (retval)
security_sb_umount_busy(mnt);
+ /* If this was a union mount, we are no longer a read-only
+ * user on the underlying mount */
+ if (mnt->mnt_flags & MNT_UNION)
+ mnt->mnt_parent->mnt_sb->s_readonly_users--;
+
up_write(&namespace_sem);
release_mounts(&umount_list);
return retval;
@@ -1511,6 +1516,10 @@ static int do_loopback(struct path *path, char *old_name, int recurse,
release_mounts(&umount_list);
}

+ /* If this is a union mount, add ourselves to the readonly users */
+ if (mnt_flags & MNT_UNION)
+ mnt->mnt_parent->mnt_sb->s_readonly_users++;
+
out:
up_write(&namespace_sem);
path_put(&old_path);
@@ -1730,6 +1739,10 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
if ((err = graft_tree(newmnt, path)))
goto unlock;

+ /* If this is a union mount, add ourselves to the readonly users */
+ if (mnt_flags & MNT_UNION)
+ newmnt->mnt_parent->mnt_sb->s_readonly_users++;
+
if (fslist) /* add to the specified expiration list */
list_add_tail(&newmnt->mnt_expire, fslist);

--
1.6.3.3

2009-10-21 19:23:35

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 37/41] union-mount: Check read-only/read-write status of layers

The top layer of a union mount must be writable (in order to support
readdir-triggered copyups) and the bottom layer must be read-only (to
avoid nasty races).

Thanks to Felix Fietkau <[email protected]> for a bug fix.

XXX - Add requirement that top layer is mounted only once

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 505974a..9b71743 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1462,6 +1462,61 @@ static int do_change_type(struct path *path, int flag)
}

/*
+ * Mount-time check of upper and lower layer file systems to see if we
+ * can union mount one on the other.
+ *
+ * Union mounts must follow these rules:
+ *
+ * - The lower layer must be read-only. This avoids lots of nasty
+ * unsolvable races where file system structures disappear suddenly.
+ * XXX - Checking the vfsmnt for read-only is a temporary hack; the
+ * file system could be mounted read-write elsewhere. We need to
+ * enforce read-only at the superblock level (patches coming).
+ *
+ * - The upper layer must be writable. This isn't an absolute
+ * requirement; right now we need it to make readdir() work since we
+ * copy up directory entries to the top level. A possible
+ * workaround is to mount a tmpfs file system transparently over the
+ * top.
+ *
+ * - The upper layer must support whiteouts and fallthrus (if it is
+ * writeable).
+ *
+ * - The lower layer must not also be a union mount. This is just to
+ * make life simpler for now, there is no inherent limitation on the
+ * number of layers.
+ *
+ * XXX - Check other mount flags for incompatibilities - I'm sure
+ * there are some.
+ */
+
+static int
+check_union_mnt(struct path *mntpnt, struct vfsmount *top_mnt, int mnt_flags)
+{
+ struct vfsmount *lower_mnt = mntpnt->mnt;
+
+ /* Is this even a union mount? */
+ if (!(mnt_flags & MNT_UNION))
+ return 0;
+
+ /* Lower layer must be read-only and not a union mount */
+ if (!(lower_mnt->mnt_sb->s_flags & MS_RDONLY) ||
+ (lower_mnt->mnt_flags & MNT_UNION))
+ return -EBUSY;
+
+ /* Upper layer must be writable */
+ if (mnt_flags & MNT_READONLY)
+ return -EROFS;
+
+ /* Upper layer must support whiteouts and fallthrus */
+ if (!(top_mnt->mnt_sb->s_flags & MS_WHITEOUT))
+ return -EINVAL;
+
+ /* All good! */
+ return 0;
+}
+
+/*
* do loopback mount.
*/
static int do_loopback(struct path *path, char *old_name, int recurse,
@@ -1495,13 +1550,8 @@ static int do_loopback(struct path *path, char *old_name, int recurse,
if (!mnt)
goto out;

- /*
- * Unions couldn't be writable if the filesystem doesn't know about
- * whiteouts
- */
- err = -ENOTSUPP;
- if ((mnt_flags & MNT_UNION) &&
- !(mnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+ err = check_union_mnt(path, mnt, mnt_flags);
+ if (err)
goto out;

if (mnt_flags & MNT_UNION)
@@ -1726,13 +1776,8 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

- /*
- * Unions couldn't be writable if the filesystem doesn't know about
- * whiteouts
- */
- err = -ENOTSUPP;
- if ((mnt_flags & MNT_UNION) &&
- !(newmnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+ err = check_union_mnt(path, newmnt, mnt_flags);
+ if (err)
goto unlock;

newmnt->mnt_flags = mnt_flags;
--
1.6.3.3

2009-10-21 19:21:50

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 38/41] union-mount: Make pivot_root work with union mounts

When moving a union mount, follow it down to the bottom layer and move
that instead of just the top layer.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9b71743..6ac5fc1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2282,6 +2282,15 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
if (d_unlinked(old.dentry))
goto out2;
error = -EBUSY;
+ /*
+ * follow_union_down() only goes one layer down. We want the
+ * bottom-most layer here - if we move that around, all the
+ * layers on top move with it. But if we ever allow more than
+ * two layers, the below two will both need to be in while()
+ * loops.
+ */
+ follow_union_down(&new.mnt, &new.dentry);
+ follow_union_down(&root.mnt, &root.dentry);
if (new.mnt == root.mnt ||
old.mnt == root.mnt)
goto out2; /* loop, on the same file system */
--
1.6.3.3

2009-10-21 19:22:57

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 39/41] union-mount: Ignore read-only file system in permission checks

In certain cases, we check a file for write access before it has been
copied up to the top-level fs. We don't want to fail because the
bottom layer is read-only - of course it is - so skip that check in
those cases.

Thanks to Felix Fietkau <[email protected]> for a bug fix.

XXX - Document when to call union_permission() vs. inode_permission()
XXX - Kinda gross. Probably a simpler solution.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 21 +++++++++++++++++----
fs/open.c | 8 ++++++--
fs/union.c | 32 ++++++++++++++++++++++++++++++--
include/linux/fs.h | 1 +
include/linux/union.h | 2 ++
5 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 61e94aa..a8d3acf 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -230,16 +230,17 @@ int generic_permission(struct inode *inode, int mask,
}

/**
- * inode_permission - check for access rights to a given inode
+ * __inode_permission - check for access rights to a given inode
* @inode: inode to check permission on
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ * @rofs: check for read-only fs
*
* Used to check for read/write/execute permissions on an inode.
* We use "fsuid" for this, letting us set arbitrary permissions
* for filesystem access without changing the "normal" uids which
* are used for other things.
*/
-int inode_permission(struct inode *inode, int mask)
+int __inode_permission(struct inode *inode, int mask, int rofs)
{
int retval;

@@ -249,7 +250,7 @@ int inode_permission(struct inode *inode, int mask)
/*
* Nobody gets write access to a read-only fs.
*/
- if (IS_RDONLY(inode) &&
+ if ((rofs & IS_RDONLY(inode)) &&
(S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
return -EROFS;

@@ -277,6 +278,18 @@ int inode_permission(struct inode *inode, int mask)
}

/**
+ * inode_permission - check for access rights to a given inode
+ * @inode: inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * This version pays attention to the MS_RDONLY flag on the fs.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+ return __inode_permission(inode, mask, 1);
+}
+
+/**
* file_permission - check for additional access rights to a given file
* @file: file to check access rights for
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
@@ -2129,7 +2142,7 @@ int may_open(struct path *path, int acc_mode, int flag)
break;
}

- error = inode_permission(inode, acc_mode);
+ error = union_permission(path, acc_mode);
if (error)
return error;

diff --git a/fs/open.c b/fs/open.c
index dd98e80..3df5a1b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -30,6 +30,7 @@
#include <linux/audit.h>
#include <linux/falloc.h>
#include <linux/fs_struct.h>
+#include <linux/union.h>

int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
@@ -333,6 +334,7 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
error = security_path_truncate(&file->f_path, length,
ATTR_MTIME|ATTR_CTIME);
if (!error)
+ /* Already copied up for union, opened with write */
error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
out_putf:
fput(file);
@@ -493,7 +495,8 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
goto out_path_release;
}

- res = inode_permission(inode, mode | MAY_ACCESS);
+ res = union_permission(&path, mode | MAY_ACCESS);
+
/* SuS v2 requires we report a read only fs too */
if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
goto out_path_release;
@@ -507,7 +510,8 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if ((!is_unionized(path.dentry, path.mnt) &&
+ (__mnt_is_readonly(path.mnt))))
res = -EROFS;

out_path_release:
diff --git a/fs/union.c b/fs/union.c
index d56b829..8d94b22 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -390,6 +390,30 @@ static int union_relookup_topmost(struct nameidata *nd, int flags)
return err;
}

+
+/**
+ * union_permission - check for access rights to a given inode
+ * @inode: inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * In a union mount, the top layer is always read-write and the bottom
+ * is always read-only. Ignore the read-only flag on the lower fs.
+ *
+ * Only need for certain activities, like checking to see if write
+ * access is ok.
+ */
+
+int union_permission(struct path *path, int mask)
+{
+ struct inode *inode = path->dentry->d_inode;
+
+ if (!is_unionized(path->dentry, path->mnt))
+ return inode_permission(inode, mask);
+
+ /* Tell __inode_permission to ignore MS_RDONLY */
+ return __inode_permission(inode, mask, 0);
+}
+
/*
* union_create_topmost - create the topmost path component
* @nd: pointer to nameidata of the base directory
@@ -489,6 +513,9 @@ static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
if (IS_ERR(new_file))
goto fput_old;

+ /* XXX be smart by using a length param, which indicates max
+ * data we'll want (e.g., we are about to truncate to 0 or 10
+ * bytes or something */
size = i_size_read(old_file->f_path.dentry->d_inode);
if (((size_t)size != size) || ((ssize_t)size != size)) {
ret = -EFBIG;
@@ -516,7 +543,8 @@ static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
* The topmost directory @new_nd must already be locked. Creates the topmost
* file if it doesn't exist yet.
*/
-int __union_copyup(struct path *old, struct nameidata *new_nd, struct path *new)
+int __union_copyup(struct path *old, struct nameidata *new_nd,
+ struct path *new)
{
struct dentry *dentry;
int error;
@@ -581,7 +609,7 @@ out_dput:
* @nd: nameidata pointer to the file
* @flags: flags given to open_namei
*/
-int union_copyup(struct nameidata *nd, int flags)
+int union_copyup(struct nameidata *nd, int flags /* XXX not used */)
{
struct qstr this;
char *name;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 57690ab..38fb113 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2106,6 +2106,7 @@ extern void emergency_remount(void);
extern sector_t bmap(struct inode *, sector_t);
#endif
extern int notify_change(struct dentry *, struct iattr *);
+extern int __inode_permission(struct inode *inode, int mask, int rofs);
extern int inode_permission(struct inode *, int);
extern int generic_permission(struct inode *, int,
int (*check_acl)(struct inode *, int));
diff --git a/include/linux/union.h b/include/linux/union.h
index a0656b3..92654e0 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -58,6 +58,7 @@ extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
extern int __union_copyup(struct path *, struct nameidata *, struct path *);
extern int union_copyup(struct nameidata *, int);
extern int union_copyup_dir(struct path *path);
+extern int union_permission(struct path *, int);

#else /* CONFIG_UNION_MOUNT */

@@ -76,6 +77,7 @@ extern int union_copyup_dir(struct path *path);
#define __union_copyup(x, y, z) ({ BUG(); (0); })
#define union_copyup(x, y) ({ (0); })
#define union_copyup_dir(x) ({ BUG(); (0); })
+#define union_permission(x, y) inode_permission((x)->dentry->d_inode, y)

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2009-10-21 19:22:11

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 40/41] union-mount: Make truncate work in all its glorious UNIX variations

Implement truncate(), ftruncate(), and open(O_TRUNC) for union mounts.

This moves the union_copyup() in do_filp_open() down below may_open()
- this way you don't copy up a file you don't even have permission to
open.

may_open() now takes a nameidata * because it may have to do a
union_copyup() internally if O_TRUNC is specified. It's a trivial
change, all callers were just doing "may_open(&nd.path, ...)" anyway.
It kinda sucks, but may_open() auto-magically doing a truncate also
sucks (may open? may truncate, too!).

XXX - Only copy up the bytes that won't be truncated.
XXX - Re-organize code. may_open() especially blah.
XXX - truncate() implemented as in-kernel file open and ftruncate()
XXX - Split up into smaller pieces

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 22 +++++----
fs/nfsctl.c | 6 +-
fs/open.c | 124 ++++++++++++++++++++--------------------------------
include/linux/fs.h | 2 +-
4 files changed, 64 insertions(+), 90 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a8d3acf..e3e8e98 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2115,8 +2115,9 @@ int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
return error;
}

-int may_open(struct path *path, int acc_mode, int flag)
+int may_open(struct nameidata *nd, int acc_mode, int flag)
{
+ struct path *path = &nd->path;
struct dentry *dentry = path->dentry;
struct inode *inode = dentry->d_inode;
int error;
@@ -2188,6 +2189,9 @@ int may_open(struct path *path, int acc_mode, int flag)
if (!error)
error = security_path_truncate(path, 0,
ATTR_MTIME|ATTR_CTIME|ATTR_OPEN);
+ /* XXX don't copy up file data */
+ if (is_unionized(path->dentry, path->mnt))
+ error = union_copyup(nd, flag /* XXX not used */);
if (!error) {
vfs_dq_init(inode);

@@ -2234,7 +2238,7 @@ out_unlock:
if (error)
return error;
/* Don't check for write permission, don't truncate */
- return may_open(&nd->path, 0, flag & ~O_TRUNC);
+ return may_open(nd, 0, flag & ~O_TRUNC);
}

/*
@@ -2309,12 +2313,6 @@ struct file *do_filp_open(int dfd, const char *pathname,
&nd, flag);
if (error)
return ERR_PTR(error);
- if (unlikely(flag & FMODE_WRITE)) {
- /* Check for union, etc. in union_copyup */
- error = union_copyup(&nd, flag /* XXX not used */);
- if (error)
- return ERR_PTR(error);
- }
goto ok;
}

@@ -2452,12 +2450,18 @@ ok:
if (error)
goto exit;
}
- error = may_open(&nd.path, acc_mode, flag);
+ error = may_open(&nd, acc_mode, flag);
if (error) {
if (will_write)
mnt_drop_write(nd.path.mnt);
goto exit;
}
+ /* Okay, all permissions go, now copy up */
+ if (!(flag & O_CREAT) && (flag & FMODE_WRITE)) {
+ error = union_copyup(&nd, flag /* XXX not used */);
+ if (error)
+ goto exit;
+ }
filp = nameidata_to_filp(&nd, open_flag);
if (IS_ERR(filp))
ima_counts_put(&nd.path,
diff --git a/fs/nfsctl.c b/fs/nfsctl.c
index 8f9a205..e3b733e 100644
--- a/fs/nfsctl.c
+++ b/fs/nfsctl.c
@@ -38,10 +38,10 @@ static struct file *do_open(char *name, int flags)
return ERR_PTR(error);

if (flags == O_RDWR)
- error = may_open(&nd.path, MAY_READ|MAY_WRITE,
- FMODE_READ|FMODE_WRITE);
+ error = may_open(&nd, MAY_READ|MAY_WRITE,
+ FMODE_READ|FMODE_WRITE);
else
- error = may_open(&nd.path, MAY_WRITE, FMODE_WRITE);
+ error = may_open(&nd, MAY_WRITE, FMODE_WRITE);

if (!error)
return dentry_open(nd.path.dentry, nd.path.mnt, flags,
diff --git a/fs/open.c b/fs/open.c
index 3df5a1b..a1da3a0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,69 +223,69 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
return err;
}

-static long do_sys_truncate(const char __user *pathname, loff_t length)
+static int __do_ftruncate(struct file *file, unsigned long length, int small)
{
- struct path path;
- struct inode *inode;
+ struct inode * inode;
+ struct dentry *dentry;
int error;

error = -EINVAL;
- if (length < 0) /* sorry, but loff_t says... */
+ if (length < 0)
goto out;
+ /* explicitly opened as large or we are on 64-bit box */
+ if (file->f_flags & O_LARGEFILE)
+ small = 0;

- error = user_path(pathname, &path);
- if (error)
+ dentry = file->f_path.dentry;
+ inode = dentry->d_inode;
+ error = -EINVAL;
+ if (!S_ISREG(inode->i_mode) || !(file->f_mode & FMODE_WRITE))
goto out;
- inode = path.dentry->d_inode;
-
- /* For directories it's -EISDIR, for other non-regulars - -EINVAL */
- error = -EISDIR;
- if (S_ISDIR(inode->i_mode))
- goto dput_and_out;

error = -EINVAL;
- if (!S_ISREG(inode->i_mode))
- goto dput_and_out;
-
- error = mnt_want_write(path.mnt);
- if (error)
- goto dput_and_out;
+ /* Cannot ftruncate over 2^31 bytes without large file support */
+ if (small && length > MAX_NON_LFS)

- error = inode_permission(inode, MAY_WRITE);
- if (error)
- goto mnt_drop_write_and_out;
+ goto out;

error = -EPERM;
if (IS_APPEND(inode))
- goto mnt_drop_write_and_out;
+ goto out;

- error = get_write_access(inode);
- if (error)
- goto mnt_drop_write_and_out;
+ error = locks_verify_truncate(inode, file, length);
+ if (!error)
+ error = security_path_truncate(&file->f_path, length,
+ ATTR_MTIME|ATTR_CTIME);
+ if (!error)
+ /* Already copied up for union, opened with write */
+ error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
+out:
+ return error;
+}

- /*
- * Make sure that there are no leases. get_write_access() protects
- * against the truncate racing with a lease-granting setlease().
- */
- error = break_lease(inode, FMODE_WRITE);
- if (error)
- goto put_write_and_out;
+static long do_sys_truncate(const char __user *pathname, loff_t length)
+{
+ struct file *file;
+ char *tmp;
+ int error;

- error = locks_verify_truncate(inode, NULL, length);
- if (!error)
- error = security_path_truncate(&path, length, 0);
- if (!error) {
- vfs_dq_init(inode);
- error = do_truncate(path.dentry, length, 0, NULL);
- }
+ error = -EINVAL;
+ if (length < 0) /* sorry, but loff_t says... */
+ return error;

-put_write_and_out:
- put_write_access(inode);
-mnt_drop_write_and_out:
- mnt_drop_write(path.mnt);
-dput_and_out:
- path_put(&path);
-out:
+ tmp = getname(pathname);
+ if (IS_ERR(tmp))
+ return PTR_ERR(tmp);
+
+ file = filp_open(tmp, O_RDWR | O_LARGEFILE, 0);
+ putname(tmp);
+
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ error = __do_ftruncate(file, length, 0);
+
+ fput(file);
return error;
}

@@ -297,46 +297,16 @@ SYSCALL_DEFINE2(truncate, const char __user *, path, unsigned long, length)

static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
{
- struct inode * inode;
- struct dentry *dentry;
struct file * file;
int error;

- error = -EINVAL;
- if (length < 0)
- goto out;
error = -EBADF;
file = fget(fd);
if (!file)
goto out;

- /* explicitly opened as large or we are on 64-bit box */
- if (file->f_flags & O_LARGEFILE)
- small = 0;
+ error = __do_ftruncate(file, length, small);

- dentry = file->f_path.dentry;
- inode = dentry->d_inode;
- error = -EINVAL;
- if (!S_ISREG(inode->i_mode) || !(file->f_mode & FMODE_WRITE))
- goto out_putf;
-
- error = -EINVAL;
- /* Cannot ftruncate over 2^31 bytes without large file support */
- if (small && length > MAX_NON_LFS)
- goto out_putf;
-
- error = -EPERM;
- if (IS_APPEND(inode))
- goto out_putf;
-
- error = locks_verify_truncate(inode, file, length);
- if (!error)
- error = security_path_truncate(&file->f_path, length,
- ATTR_MTIME|ATTR_CTIME);
- if (!error)
- /* Already copied up for union, opened with write */
- error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
-out_putf:
fput(file);
out:
return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 38fb113..8eb0e0e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2134,7 +2134,7 @@ extern void free_write_pipe(struct file *);

extern struct file *do_filp_open(int dfd, const char *pathname,
int open_flag, int mode, int acc_mode);
-extern int may_open(struct path *, int, int);
+extern int may_open(struct nameidata *, int, int);

extern int kernel_read(struct file *, loff_t, char *, unsigned long);
extern struct file * open_exec(const char *);
--
1.6.3.3

2009-10-21 19:22:07

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 41/41] union-mount: Add support for rename by __union_copyup()

From: Jan Blunck <[email protected]>

It is possible to use __union_copyup() to support rename of regular files
without returning -EXDEV.

XXX - Rewrite as copyup to old name followed by rename() + whiteout()

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 350 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 344 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e3e8e98..8419e1e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1842,6 +1842,239 @@ out:
return res;
}

+/**
+ * do_union_hash_lookup() - walk down the union stack and lookup_hash()
+ * @nd: nameidata of parent to lookup from
+ * @name: pathname component to lookup
+ * @path: path to store result of lookup in
+ *
+ * Walk down the union stack and search for single pathname component name. It
+ * is assumed that the caller already did a lookup_hash() in the topmost parent
+ * that gave negative lookup result. Therefore this does call lookup_hash() in
+ * every lower layer (!) of the union stack. If a directory is found the union
+ * stack for that is assembled as well.
+ *
+ * Note:
+ * The caller needs to take care of holding a valid reference to the topmost
+ * parent.
+ * On error we leave @path untouched as well as when we don't find anything.
+ */
+static int do_union_hash_lookup(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path next;
+ int err = 0;
+
+ while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
+ /* rehash because of d_op->d_hash() by the previous layer */
+ name->hash = full_name_hash(name->name, name->len);
+
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ err = lookup_hash(nd, name, &next);
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+
+ if (err)
+ break;
+
+ if (next.dentry->d_inode) {
+ mntget(next.mnt);
+ if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
+ *path = next;
+ break;
+ }
+ err = __hash_lookup_build_union(nd, name, &next);
+ if (err)
+ path_put(&next);
+ else
+ *path = next;
+ break;
+ }
+
+ path_put_conditional(&next, nd);
+
+ if ((IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(next.dentry)) ||
+ d_is_whiteout(next.dentry))
+ break;
+ }
+
+ return err;
+}
+
+/**
+ * _hash_lookup_union() - lookup single pathname component
+ * @nd: nameidata of parent to lookup from
+ * @name: pathname component to lookup
+ * @path: path to store result of lookup in
+ *
+ * Returns the topmost parent locked and the target dentry found in the union
+ * or the topmost negative target dentry otherwise.
+ *
+ * Note:
+ * Returns topmost parent locked even on error.
+ */
+static int _hash_lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ struct path parent = nd->path;
+ struct path topmost;
+ int err;
+
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ err = lookup_hash(nd, name, path);
+ if (err)
+ return err;
+
+ /* return if we found something and it isn't a directory we are done */
+ if (path->dentry->d_inode && !S_ISDIR(path->dentry->d_inode->i_mode))
+ return 0;
+
+ /* stop lookup if the parent directory is marked opaque */
+ if ((IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(path->dentry)) ||
+ d_is_whiteout(path->dentry))
+ return 0;
+
+ if (!strcmp(path->mnt->mnt_sb->s_type->name, "proc") ||
+ !strcmp(path->mnt->mnt_sb->s_type->name, "sysfs"))
+ return 0;
+
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+
+ /*
+ * safe a reference to the topmost parent for walking the union stack
+ */
+ path_get(&parent);
+ topmost = *path;
+
+ if (path->dentry->d_inode && S_ISDIR(path->dentry->d_inode->i_mode)) {
+ err = __hash_lookup_build_union(nd, name, path);
+ if (err)
+ goto err_lock_parent;
+ goto out_lock_and_revalidate_parent;
+ }
+
+ err = do_union_hash_lookup(nd, name, path);
+ if (err)
+ goto err_lock_parent;
+
+out_lock_and_revalidate_parent:
+ /* seems that we haven't found anything, so return the topmost */
+ path_to_nameidata(&parent, nd);
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+
+ if (topmost.dentry == path->dentry) {
+ spin_lock(&path->dentry->d_lock);
+ if (nd->path.dentry != path->dentry->d_parent) {
+ spin_unlock(&path->dentry->d_lock);
+ dput(path->dentry);
+ name->hash = full_name_hash(name->name, name->len);
+ err = lookup_hash(nd, name, path);
+ if (err)
+ return err;
+ /* FIXME: What if we find a directory here ... */
+ return err;
+ }
+ spin_unlock(&path->dentry->d_lock);
+ } else
+ dput(topmost.dentry);
+
+ return 0;
+
+err_lock_parent:
+ path_to_nameidata(&parent, nd);
+ path_put_conditional(path, nd);
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ return err;
+}
+
+/**
+ * lookup_rename_source() - lookup the source used by rename
+ *
+ * This is a special version of _hash_lookup_union() which becomes necessary
+ * for finding the source of a rename on union mounts.
+ *
+ * See comment for _hash_lookup_union() above.
+ */
+static int lookup_rename_source(struct nameidata *oldnd,
+ struct nameidata *newnd,
+ struct dentry **trap, struct qstr *name,
+ struct path *old)
+{
+ struct path parent = oldnd->path;
+ struct path topmost;
+ int err;
+
+ err = lookup_hash(oldnd, name, old);
+ if (err)
+ return err;
+
+ /* return if we found something and it isn't a directory we are done */
+ if (old->dentry->d_inode && !S_ISDIR(old->dentry->d_inode->i_mode))
+ return 0;
+
+ /* stop lookup if the parent directory is marked opaque */
+ if ((IS_OPAQUE(oldnd->path.dentry->d_inode) &&
+ !d_is_fallthru(old->dentry)) ||
+ d_is_whiteout(old->dentry))
+ return 0;
+
+ if (!strcmp(old->mnt->mnt_sb->s_type->name, "proc") ||
+ !strcmp(old->mnt->mnt_sb->s_type->name, "sysfs"))
+ return 0;
+
+ unlock_rename(oldnd->path.dentry, newnd->path.dentry);
+
+ /*
+ * safe a reference to the topmost parent for walking the union stack
+ */
+ path_get(&parent);
+ topmost = *old;
+
+ if (old->dentry->d_inode && S_ISDIR(old->dentry->d_inode->i_mode)) {
+ err = __hash_lookup_build_union(oldnd, name, old);
+ if (err)
+ goto err_lock;
+ goto out_lock_and_revalidate_parent;
+ }
+
+ err = do_union_hash_lookup(oldnd, name, old);
+ if (err)
+ goto err_lock;
+
+out_lock_and_revalidate_parent:
+ path_to_nameidata(&parent, oldnd);
+ *trap = lock_rename(oldnd->path.dentry, newnd->path.dentry);
+
+ /*
+ * If we return the topmost dentry we have to make sure that it has not
+ * been moved away while we gave up the topmost parents i_mutex lock.
+ */
+ if (topmost.dentry == old->dentry) {
+ spin_lock(&old->dentry->d_lock);
+ if (oldnd->path.dentry != old->dentry->d_parent) {
+ spin_unlock(&old->dentry->d_lock);
+ dput(old->dentry);
+ name->hash = full_name_hash(name->name, name->len);
+ err = lookup_hash(oldnd, name, old);
+ if (err)
+ return err;
+ /* FIXME: What if we find a directory here ... */
+ return err;
+ }
+ spin_unlock(&old->dentry->d_lock);
+ } else
+ dput(topmost.dentry);
+
+ return 0;
+
+err_lock:
+ path_to_nameidata(&parent, oldnd);
+ path_put_conditional(old, oldnd);
+ *trap = lock_rename(oldnd->path.dentry, newnd->path.dentry);
+ return err;
+}
+
static int __lookup_one_len(const char *name, struct qstr *this,
struct dentry *base, int len)
{
@@ -3544,6 +3777,91 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
return error;
}

+static int vfs_rename_union(struct nameidata *oldnd, struct path *old,
+ struct nameidata *newnd, struct path *new)
+{
+ struct inode *old_dir = oldnd->path.dentry->d_inode;
+ struct inode *new_dir = newnd->path.dentry->d_inode;
+ struct qstr old_name;
+ char *name;
+ struct dentry *dentry;
+ int error;
+
+ if (old->dentry->d_inode == new->dentry->d_inode)
+ return 0;
+ error = may_whiteout(old_dir, old->dentry, 0);
+ if (error)
+ return error;
+ if (!old_dir->i_op || !old_dir->i_op->whiteout)
+ return -EPERM;
+
+ if (!new->dentry->d_inode)
+ error = may_create(new_dir, new->dentry);
+ else
+ error = may_delete(new_dir, new->dentry, 0);
+ if (error)
+ return error;
+
+ vfs_dq_init(old_dir);
+ vfs_dq_init(new_dir);
+
+ error = -EBUSY;
+ if (d_mountpoint(old->dentry) || d_mountpoint(new->dentry))
+ return error;
+
+ error = -ENOMEM;
+ name = kmalloc(old->dentry->d_name.len, GFP_KERNEL);
+ if (!name)
+ return error;
+ strncpy(name, old->dentry->d_name.name, old->dentry->d_name.len);
+ name[old->dentry->d_name.len] = 0;
+ old_name.len = old->dentry->d_name.len;
+ old_name.hash = old->dentry->d_name.hash;
+ old_name.name = name;
+
+ /* possibly delete the existing new file */
+ if ((newnd->path.dentry == new->dentry->d_parent) &&
+ new->dentry->d_inode) {
+ /* FIXME: inode may be truncated while we hold a lock */
+ error = vfs_unlink(new_dir, new->dentry);
+ if (error)
+ goto freename;
+
+ dentry = __lookup_hash(&new->dentry->d_name,
+ newnd->path.dentry, newnd);
+ if (IS_ERR(dentry))
+ goto freename;
+
+ dput(new->dentry);
+ new->dentry = dentry;
+ }
+
+ /* copyup to the new file */
+ error = __union_copyup(old, newnd, new);
+ if (error)
+ goto freename;
+
+ /* whiteout the old file */
+ dentry = __lookup_hash(&old_name, oldnd->path.dentry, oldnd);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto freename;
+ error = vfs_whiteout(old_dir, dentry, 0);
+ dput(dentry);
+
+ /* FIXME: This is acutally unlink() && create() ... */
+/*
+ if (!error) {
+ const char *new_name = old_dentry->d_name.name;
+ fsnotify_move(old_dir, new_dir, old_name.name, new_name, 0,
+ new_dentry->d_inode, old_dentry->d_inode);
+ }
+*/
+freename:
+ kfree(old_name.name);
+ return error;
+}
+
SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
{
@@ -3582,7 +3900,20 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,

trap = lock_rename(new_dir, old_dir);

- error = hash_lookup_union(&oldnd, &oldnd.last, &old);
+ /*
+ * For union mounts we need to call a giant lookup_rename_source()
+ * instead.
+ * First lock_rename() and look on the topmost fs like you would do in
+ * the normal rename, if you find something which is not a directory,
+ * go ahead and lookup target and do normal rename.
+ * If you find a negative dentry, unlock_rename() and continue as
+ * _hash_lookup_union() would do without locking the topmost parent
+ * at the end. After that do lock_rename() of the source parent and the
+ * target parent and do a copyup with additional whiteout creation at
+ * the end.
+ */
+// error = hash_lookup_union(&oldnd, &oldnd.last, &old);
+ error = lookup_rename_source(&oldnd, &newnd, &trap, &oldnd.last, &old);
if (error)
goto exit3;
/* source must exist */
@@ -3601,19 +3932,21 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
- error = hash_lookup_union(&newnd, &newnd.last, &new);
+ /* target is always on topmost fs, even with unions */
+ error = lookup_hash(&newnd, &newnd.last, &new);
if (error)
goto exit4;
/* target should not be an ancestor of source */
error = -ENOTEMPTY;
if (new.dentry == trap)
goto exit5;
- /* renaming on unions is done by the user-space */
+ /* renaming of directories on unions is done by the user-space */
error = -EXDEV;
- if (is_unionized(oldnd.path.dentry, oldnd.path.mnt))
- goto exit5;
- if (is_unionized(newnd.path.dentry, newnd.path.mnt))
+ if (is_unionized(oldnd.path.dentry, oldnd.path.mnt) &&
+ S_ISDIR(old.dentry->d_inode->i_mode))
goto exit5;
+// if (is_unionized(newnd.path.dentry, newnd.path.mnt))
+// goto exit5;

error = mnt_want_write(oldnd.path.mnt);
if (error)
@@ -3622,6 +3955,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
&newnd.path, new.dentry);
if (error)
goto exit6;
+ if (is_unionized(oldnd.path.dentry, oldnd.path.mnt) &&
+ (old.dentry->d_parent != oldnd.path.dentry)) {
+ error = vfs_rename_union(&oldnd, &old, &newnd, &new);
+ goto exit6;
+ }
error = vfs_rename(old_dir->d_inode, old.dentry,
new_dir->d_inode, new.dentry);
exit6:
--
1.6.3.3

2009-10-21 21:18:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 15/41] whiteout: ext2 whiteout support

On 2009-10-21, at 13:19, Valerie Aurora wrote:
> This patch adds whiteout support to EXT2. A whiteout is an empty
> directory
> entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
> allocates space in directories. Due to being implemented as a
> filetype it is
> necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.
>
> diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
> index 121720d..bd10826 100644
> --- a/include/linux/ext2_fs.h
> +++ b/include/linux/ext2_fs.h
> @@ -189,6 +189,7 @@ struct ext2_group_desc
> +#define EXT2_OPAQUE_FL 0x00040000

Please check in the upstream e2fsprogs ext2_fs.h before defining new
flag
values for ext2/3/4. In this case, 0x40000 conflicts with
EXT4_HUGE_FILE_FL,
which is of course bad.


> @@ -503,10 +504,12 @@ struct ext2_super_block {
> #define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
> +#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020

This one doesn't conflict, probably due to luck, because 0x0040-0x0200
are
already in use for other features. I'm not sure if 0x0020 was
reserved for
some other use, or just skipped to avoid potential conflicts.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-10-21 22:51:18

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 16/41] whiteout: jffs2 whiteout support

On Wed, 2009-10-21 at 12:19 -0700, Valerie Aurora wrote:
> From: Felix Fietkau <[email protected]>
>
> Add support for whiteout dentries to jffs2.

As discussed, there are a few places where JFFS2 will assume that a
dirent with fd->ino == 0 is a deletion dirent -- a kind of whiteout of
its own, used internally because it's a log-structured file system and
it needs to mark previously existing dirents as having been unlinked.

You're breaking that assumption. So, for example, your whiteouts are
going to get lost when the eraseblock containing them is garbage
collected -- because they'll be treated like deletion dirents, which
only need to remain on the medium for as long as the _real_ dirents
which they exist to kill.

This completely untested patch addresses some of it.

The other thing to verify is the three places in dir.c which check
whether whiteout/rmdir/rename should return -ENOTEMPTY. Those all do so
by checking whether the directory in question has any dirents with
fd->ino != 0 -- i.e. does it contain any _real_ dirents, or only the
deletion markers for dead stuff.

So that will now be _allowing_ you to remove a directory which contains
whiteouts, since you haven't changed the test. Is that intentional? It
seems sane at first glance.

diff --git a/fs/jffs2/build.c b/fs/jffs2/build.c
index c5e1450..4dc883f 100644
--- a/fs/jffs2/build.c
+++ b/fs/jffs2/build.c
@@ -217,8 +217,9 @@ static void jffs2_build_remove_unlinked_inode(struct jffs2_sb_info *c,
ic->scan_dents = fd->next;

if (!fd->ino) {
- /* It's a deletion dirent. Ignore it */
- dbg_fsbuild("child \"%s\" is a deletion dirent, skipping...\n", fd->name);
+ dbg_fsbuild("child \"%s\" is a %s, skipping...\n",
+ fd->name,
+ (fd->type == DT_WHT)?"whiteout":"deletion dirent");
jffs2_free_full_dirent(fd);
continue;
}
diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
index 090c556..7f5afbb 100644
--- a/fs/jffs2/gc.c
+++ b/fs/jffs2/gc.c
@@ -516,7 +516,7 @@ static int jffs2_garbage_collect_live(struct jffs2_sb_info *c, struct jffs2_era
break;
}

- if (fd && fd->ino) {
+ if (fd && (fd->ino || fd->type == DT_WHT)) {
ret = jffs2_garbage_collect_dirent(c, jeb, f, fd);
} else if (fd) {
ret = jffs2_garbage_collect_deletion_dirent(c, jeb, f, fd);
@@ -895,7 +895,7 @@ static int jffs2_garbage_collect_deletion_dirent(struct jffs2_sb_info *c, struct
continue;

/* If the name length doesn't match, or it's another deletion dirent, skip */
- if (rd->nsize != name_len || !je32_to_cpu(rd->ino))
+ if (rd->nsize != name_len || (!je32_to_cpu(rd->ino) && rd->type != DT_WHT))
continue;

/* OK, check the actual name now */
diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c
index ca29440..bcd4b86 100644
--- a/fs/jffs2/write.c
+++ b/fs/jffs2/write.c
@@ -629,8 +629,9 @@ int jffs2_do_unlink(struct jffs2_sb_info *c, struct jffs2_inode_info *dir_f,
printk(KERN_WARNING "Deleting inode #%u with active dentry \"%s\"->ino #%u\n",
dead_f->inocache->ino, fd->name, fd->ino);
} else {
- D1(printk(KERN_DEBUG "Removing deletion dirent for \"%s\" from dir ino #%u\n",
- fd->name, dead_f->inocache->ino));
+ D1(printk(KERN_DEBUG "Removing %s for \"%s\" from dir ino #%u\n",
+ (fd->type == DT_WHT)?"whiteout":"deletion dirent",
+ fd->name, dead_f->inocache->ino));
}
if (fd->raw)
jffs2_mark_node_obsolete(c, fd->raw);


--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-10-22 02:45:31

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC PATCH 00/40] Writable overlays (union mounts)


Hi,

Valerie Aurora:
> Here is the current patch set for writable overlays (union mounts).
> It needs lots of review! Especially the bits where we do nasty things
> with readdir().
>
> Writable overlays let you mount one read-write file system
> transparently over another read-only file system. This is useful for
> things like LiveCDs. Detailed documentation and HOWTO here:

Are these issues what I have pointed out addressed?

========================================
> ----------------------------------------------------------------------
> I believe 'fallthru' in UnionMount is a good idea. But I am afraid it
> may consume memory too much, particulary when the upper layer is tmpfs.
> While one fallthru entry is small, recent LiveCD contains very many
> files by squashfs and its size grows as DVD. If users try 'find /', then
> many fallthru entires will be created and I am afraid it becomes memory
> pressure.
> How do you think about that?
> ----------------------------------------------------------------------
> I am afraid this issue may not be solved soon. It should be listed in a
> longer term todo list, or no action to be taken (this is a feature).

Hm. The fallthru entries are only essential when it comes to
directories with mixed top/bottom entries during a readdir(). I can
think of some ways to make fallthrus less common, or to be able to
throw them out. I will keep this in mind, thanks!

-VAL

========================================
> - link(2) doesn't work
> When the source file exists on the lower, it returns "Invalid
> cross-device link" error.
> - Is it an expected behaviour?
> If UnionMount behaves as an ordinary filesystem, link(2) should work.
> But UnionMount is not a filesystem actually. So to return the error
> may be correct. I am not sure which is true.
>
> Do I make my clear?

Yes, I understand now. This comes back to the same userland problem
as rename(); technically userland should support fallback for this,
but many apps assume it can't happen in the same directory. I think
we could make this work without copying up the file if we make a
fallthru for the target.

In general, it might be good to have a config or mount option to
enable/disable the EXDEV returns, and printk something when the
workaround is triggered. This would give us a migration path to a
future in which userland utilities can deal with EXDEV in the same
directory.

Both are on my todo list.

-VAL

========================================
> I might find a minor issue about copyup and read(2).
> When two processes open the same file, with O_RDONLY and O_WRONLY
> individually. One of them issues read(2), and the other issues write(2)
> at the same time.
>
> ProcessA
> - open(O_RDONLY)
> - read
>
> ProcessB
> - open(O_WRONLY)
> - write
>
> If read(2) executes before write(2), ProcessA gets the correct latest
> (at that point) filedata. But if write(2) by ProcessB executes first,
> the filedata ProcessA got may be obsoleted since it still refers to the
> file on the lower readonly fs.
> Users may not be aware since it is hard to know whether write(2) was
> executed first, and this issue may be minor.
>
> This scenario can happen in a single process.
>
> ProcessC
> - open(O_RDONLY)
> - open(O_WRONLY)
> - write
> - read
>
> This is not a race condition actually, but ProcessC will get the
> obsoleted filedata. It will not get the filedata which it just wrote.
> While I don't think there exists such application :-), users may think
> it a problem.

I see what you mean!

I guess you can view it as effectively a rename() over the old file -
it's the same as if you instead created a new file, copied all the
data into it, and then renamed it over the old file. Which is a very
common method of updating files.

It will indeed be interesting to see if any applications break as a
result of this. Hopefully not, all the solutions I can think of are
quite terrible.

-VAL

========================================

I just want to confirm (and never mean to push you).


J. R. Okajima

2009-10-27 02:15:02

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 15/41] whiteout: ext2 whiteout support

On Wed, Oct 21, 2009 at 03:17:38PM -0600, Andreas Dilger wrote:
> On 2009-10-21, at 13:19, Valerie Aurora wrote:
> >This patch adds whiteout support to EXT2. A whiteout is an empty
> >directory
> >entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
> >allocates space in directories. Due to being implemented as a
> >filetype it is
> >necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.
> >
> >diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
> >index 121720d..bd10826 100644
> >--- a/include/linux/ext2_fs.h
> >+++ b/include/linux/ext2_fs.h
> >@@ -189,6 +189,7 @@ struct ext2_group_desc
> >+#define EXT2_OPAQUE_FL 0x00040000
>
> Please check in the upstream e2fsprogs ext2_fs.h before defining new
> flag
> values for ext2/3/4. In this case, 0x40000 conflicts with
> EXT4_HUGE_FILE_FL,
> which is of course bad.
>
>
> >@@ -503,10 +504,12 @@ struct ext2_super_block {
> >#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
> >+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
>
> This one doesn't conflict, probably due to luck, because 0x0040-0x0200
> are
> already in use for other features. I'm not sure if 0x0020 was
> reserved for
> some other use, or just skipped to avoid potential conflicts.

Thanks for reviewing! I'll fix that in the next rev.

-VAL

2009-10-27 02:21:21

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 16/41] whiteout: jffs2 whiteout support

On Thu, Oct 22, 2009 at 07:50:49AM +0900, David Woodhouse wrote:
> On Wed, 2009-10-21 at 12:19 -0700, Valerie Aurora wrote:
> > From: Felix Fietkau <[email protected]>
> >
> > Add support for whiteout dentries to jffs2.
>
> As discussed, there are a few places where JFFS2 will assume that a
> dirent with fd->ino == 0 is a deletion dirent -- a kind of whiteout of
> its own, used internally because it's a log-structured file system and
> it needs to mark previously existing dirents as having been unlinked.
>
> You're breaking that assumption. So, for example, your whiteouts are
> going to get lost when the eraseblock containing them is garbage
> collected -- because they'll be treated like deletion dirents, which
> only need to remain on the medium for as long as the _real_ dirents
> which they exist to kill.
>
> This completely untested patch addresses some of it.

I think you are right. Thanks! I will add JFFS2 to my test suite
before the next release. Right now I am testing mostly on UML, which
doesn't support the RAM-based MTD emulator as far I can tell.

> The other thing to verify is the three places in dir.c which check
> whether whiteout/rmdir/rename should return -ENOTEMPTY. Those all do so
> by checking whether the directory in question has any dirents with
> fd->ino != 0 -- i.e. does it contain any _real_ dirents, or only the
> deletion markers for dead stuff.
>
> So that will now be _allowing_ you to remove a directory which contains
> whiteouts, since you haven't changed the test. Is that intentional? It
> seems sane at first glance.

Yes, you should be able to remove a directory which contains only
union mount-level whiteouts.

-VAL

> diff --git a/fs/jffs2/build.c b/fs/jffs2/build.c
> index c5e1450..4dc883f 100644
> --- a/fs/jffs2/build.c
> +++ b/fs/jffs2/build.c
> @@ -217,8 +217,9 @@ static void jffs2_build_remove_unlinked_inode(struct jffs2_sb_info *c,
> ic->scan_dents = fd->next;
>
> if (!fd->ino) {
> - /* It's a deletion dirent. Ignore it */
> - dbg_fsbuild("child \"%s\" is a deletion dirent, skipping...\n", fd->name);
> + dbg_fsbuild("child \"%s\" is a %s, skipping...\n",
> + fd->name,
> + (fd->type == DT_WHT)?"whiteout":"deletion dirent");
> jffs2_free_full_dirent(fd);
> continue;
> }
> diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
> index 090c556..7f5afbb 100644
> --- a/fs/jffs2/gc.c
> +++ b/fs/jffs2/gc.c
> @@ -516,7 +516,7 @@ static int jffs2_garbage_collect_live(struct jffs2_sb_info *c, struct jffs2_era
> break;
> }
>
> - if (fd && fd->ino) {
> + if (fd && (fd->ino || fd->type == DT_WHT)) {
> ret = jffs2_garbage_collect_dirent(c, jeb, f, fd);
> } else if (fd) {
> ret = jffs2_garbage_collect_deletion_dirent(c, jeb, f, fd);
> @@ -895,7 +895,7 @@ static int jffs2_garbage_collect_deletion_dirent(struct jffs2_sb_info *c, struct
> continue;
>
> /* If the name length doesn't match, or it's another deletion dirent, skip */
> - if (rd->nsize != name_len || !je32_to_cpu(rd->ino))
> + if (rd->nsize != name_len || (!je32_to_cpu(rd->ino) && rd->type != DT_WHT))
> continue;
>
> /* OK, check the actual name now */
> diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c
> index ca29440..bcd4b86 100644
> --- a/fs/jffs2/write.c
> +++ b/fs/jffs2/write.c
> @@ -629,8 +629,9 @@ int jffs2_do_unlink(struct jffs2_sb_info *c, struct jffs2_inode_info *dir_f,
> printk(KERN_WARNING "Deleting inode #%u with active dentry \"%s\"->ino #%u\n",
> dead_f->inocache->ino, fd->name, fd->ino);
> } else {
> - D1(printk(KERN_DEBUG "Removing deletion dirent for \"%s\" from dir ino #%u\n",
> - fd->name, dead_f->inocache->ino));
> + D1(printk(KERN_DEBUG "Removing %s for \"%s\" from dir ino #%u\n",
> + (fd->type == DT_WHT)?"whiteout":"deletion dirent",
> + fd->name, dead_f->inocache->ino));
> }
> if (fd->raw)
> jffs2_mark_node_obsolete(c, fd->raw);
>
>
> --
> David Woodhouse Open Source Technology Centre
> [email protected] Intel Corporation
>

2009-10-27 02:23:21

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC PATCH 00/40] Writable overlays (union mounts)

On Thu, Oct 22, 2009 at 11:44:40AM +0900, [email protected] wrote:
>
> Hi,
>
> Valerie Aurora:
> > Here is the current patch set for writable overlays (union mounts).
> > It needs lots of review! Especially the bits where we do nasty things
> > with readdir().
> >
> > Writable overlays let you mount one read-write file system
> > transparently over another read-only file system. This is useful for
> > things like LiveCDs. Detailed documentation and HOWTO here:
>
> Are these issues what I have pointed out addressed?

Not in this release, no. I just wanted to get something out there for
review. rename() is particularly high on my list. Thank you for
keeping track!

-VAL

> ========================================
> > ----------------------------------------------------------------------
> > I believe 'fallthru' in UnionMount is a good idea. But I am afraid it
> > may consume memory too much, particulary when the upper layer is tmpfs.
> > While one fallthru entry is small, recent LiveCD contains very many
> > files by squashfs and its size grows as DVD. If users try 'find /', then
> > many fallthru entires will be created and I am afraid it becomes memory
> > pressure.
> > How do you think about that?
> > ----------------------------------------------------------------------
> > I am afraid this issue may not be solved soon. It should be listed in a
> > longer term todo list, or no action to be taken (this is a feature).
>
> Hm. The fallthru entries are only essential when it comes to
> directories with mixed top/bottom entries during a readdir(). I can
> think of some ways to make fallthrus less common, or to be able to
> throw them out. I will keep this in mind, thanks!
>
> -VAL
>
> ========================================
> > - link(2) doesn't work
> > When the source file exists on the lower, it returns "Invalid
> > cross-device link" error.
> > - Is it an expected behaviour?
> > If UnionMount behaves as an ordinary filesystem, link(2) should work.
> > But UnionMount is not a filesystem actually. So to return the error
> > may be correct. I am not sure which is true.
> >
> > Do I make my clear?
>
> Yes, I understand now. This comes back to the same userland problem
> as rename(); technically userland should support fallback for this,
> but many apps assume it can't happen in the same directory. I think
> we could make this work without copying up the file if we make a
> fallthru for the target.
>
> In general, it might be good to have a config or mount option to
> enable/disable the EXDEV returns, and printk something when the
> workaround is triggered. This would give us a migration path to a
> future in which userland utilities can deal with EXDEV in the same
> directory.
>
> Both are on my todo list.
>
> -VAL
>
> ========================================
> > I might find a minor issue about copyup and read(2).
> > When two processes open the same file, with O_RDONLY and O_WRONLY
> > individually. One of them issues read(2), and the other issues write(2)
> > at the same time.
> >
> > ProcessA
> > - open(O_RDONLY)
> > - read
> >
> > ProcessB
> > - open(O_WRONLY)
> > - write
> >
> > If read(2) executes before write(2), ProcessA gets the correct latest
> > (at that point) filedata. But if write(2) by ProcessB executes first,
> > the filedata ProcessA got may be obsoleted since it still refers to the
> > file on the lower readonly fs.
> > Users may not be aware since it is hard to know whether write(2) was
> > executed first, and this issue may be minor.
> >
> > This scenario can happen in a single process.
> >
> > ProcessC
> > - open(O_RDONLY)
> > - open(O_WRONLY)
> > - write
> > - read
> >
> > This is not a race condition actually, but ProcessC will get the
> > obsoleted filedata. It will not get the filedata which it just wrote.
> > While I don't think there exists such application :-), users may think
> > it a problem.
>
> I see what you mean!
>
> I guess you can view it as effectively a rename() over the old file -
> it's the same as if you instead created a new file, copied all the
> data into it, and then renamed it over the old file. Which is a very
> common method of updating files.
>
> It will indeed be interesting to see if any applications break as a
> result of this. Hopefully not, all the solutions I can think of are
> quite terrible.
>
> -VAL
>
> ========================================
>
> I just want to confirm (and never mean to push you).
>
>
> J. R. Okajima

2009-10-27 14:36:17

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 10/41] whiteout: Add vfs_whiteout() and whiteout inode operation

On Wed, Oct 21, 2009 at 3:19 PM, Valerie Aurora <[email protected]> wrote:
> From: Jan Blunck <[email protected]>
>
> Simply white-out a given directory entry. This functionality is usually used
> in the sense of unlink. Therefore the given dentry can still be in-use and
> contains an in-use inode. The filesystems inode operation has to do what
> unlink or rmdir would in that case. Since the dentry still might be in-use
> we have to provide a fresh unhashed dentry that is used as the whiteout
> dentry instead. The given dentry is dropped and the whiteout dentry is
> rehashed instead.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> ?fs/dcache.c ? ? ? ? ? ?| ? ?4 +-
> ?fs/namei.c ? ? ? ? ? ? | ?104 ++++++++++++++++++++++++++++++++++++++++++++++++
> ?include/linux/dcache.h | ? ?6 +++
> ?include/linux/fs.h ? ? | ? ?3 +
> ?4 files changed, 116 insertions(+), 1 deletions(-)
>

> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 30b93b2..7648b49 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -183,6 +183,7 @@ d_iput: ? ? ? ? ? ? no ? ? ? ? ? ? ?no ? ? ? ? ? ? ?no ? ? ? yes
> ?#define DCACHE_INOTIFY_PARENT_WATCHED ?0x0020 /* Parent inode is watched by inotify */
>
> ?#define DCACHE_COOKIE ? ? ? ? ?0x0040 ?/* For use by dcookie subsystem */
> +#define DCACHE_WHITEOUT ? ? ? ? ? ? ? ?0x0080 ?/* This negative dentry is a whiteout */
>
> ?#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */
>

I don't think you want 2 flags with the 0x0080 value....... This
can't be right.

2009-10-27 21:22:31

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 10/41] whiteout: Add vfs_whiteout() and whiteout inode operation

On Tue, Oct 27, 2009 at 10:36:18AM -0400, Eric Paris wrote:
> On Wed, Oct 21, 2009 at 3:19 PM, Valerie Aurora <[email protected]> wrote:
> > From: Jan Blunck <[email protected]>
> >
> > Simply white-out a given directory entry. This functionality is usually used
> > in the sense of unlink. Therefore the given dentry can still be in-use and
> > contains an in-use inode. The filesystems inode operation has to do what
> > unlink or rmdir would in that case. Since the dentry still might be in-use
> > we have to provide a fresh unhashed dentry that is used as the whiteout
> > dentry instead. The given dentry is dropped and the whiteout dentry is
> > rehashed instead.
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > ?fs/dcache.c ? ? ? ? ? ?| ? ?4 +-
> > ?fs/namei.c ? ? ? ? ? ? | ?104 ++++++++++++++++++++++++++++++++++++++++++++++++
> > ?include/linux/dcache.h | ? ?6 +++
> > ?include/linux/fs.h ? ? | ? ?3 +
> > ?4 files changed, 116 insertions(+), 1 deletions(-)
> >
>
> > diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> > index 30b93b2..7648b49 100644
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -183,6 +183,7 @@ d_iput: ? ? ? ? ? ? no ? ? ? ? ? ? ?no ? ? ? ? ? ? ?no ? ? ? yes
> > ?#define DCACHE_INOTIFY_PARENT_WATCHED ?0x0020 /* Parent inode is watched by inotify */
> >
> > ?#define DCACHE_COOKIE ? ? ? ? ?0x0040 ?/* For use by dcookie subsystem */
> > +#define DCACHE_WHITEOUT ? ? ? ? ? ? ? ?0x0080 ?/* This negative dentry is a whiteout */
> >
> > ?#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */
> >
>
> I don't think you want 2 flags with the 0x0080 value....... This
> can't be right.

This looks like a merge error I introduced during a rebase. Thanks
for catching it!

-VAL

2009-11-30 02:08:40

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 01/41] VFS: BUG() if somebody tries to rehash an already hashed dentry

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Break early when somebody tries to rehash an already hashed dentry.
> Otherwise this leads to interesting corruptions in the dcache hash table
> later on.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 9e5cd3c..38bf982 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1550,6 +1550,7 @@ void d_rehash(struct dentry * entry)
> {
> spin_lock(&dcache_lock);
> spin_lock(&entry->d_lock);
> + BUG_ON(!d_unhashed(entry));
> _d_rehash(entry);
> spin_unlock(&entry->d_lock);
> spin_unlock(&dcache_lock);

This patch seems unrelated to union mounts. If so, can you get it pushed
upstream sooner? Or is this a debugging patch useful only when developing
union mounts?

You also said that it can lead to "ineresting corruptions". What kind of
corruptions exactly? Also, would it make more sense to allow _d_rehash() to
hash in an unhashed dentry for the first time?

Erez.

PS. apologies for the belated review. I need a thanksgiving break once a
month to catch up to emails. :-)

2009-11-30 02:02:59

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 03/41] VFS: Make lookup_hash() return a struct path

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> This patch changes lookup_hash() into returning a struct path.

Actually, lookup_hash now also takes a qstr.

This is a somewhat involved patch. I think more documentation is needed to
list all the places it touches and changes, b/c now struct path has to
propagate in various other places. (In general, passing struct path instead
of struct dentry is going in the right direction: eventually we could get rid
of lookup_one_len.)

> @@ -1219,14 +1219,22 @@ out:
> * needs parent already locked. Doesn't follow mounts.
> * SMP-safe.
> */
> -static struct dentry *lookup_hash(struct nameidata *nd)
> +static int lookup_hash(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> {

I suggest you document above this function what the @name and @path are for,
who is supposed to allocate and free them, caller/callee's responsibilities,
side effects (if any), new return status upon success/failure, etc.

>
> err = inode_permission(nd->path.dentry->d_inode, MAY_EXEC);
> if (err)
> - return ERR_PTR(err);
> - return __lookup_hash(&nd->last, nd->path.dentry, nd);
> + return err;

At least initially, while all this code is being developed, it might also be
a good idea to add

BUG_ON(!name);
BUG_ON(!path);

here and possibly in other places which are now taking new pointers.

Erez.

2009-11-30 02:08:09

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 04/41] VFS: Remove unnecessary micro-optimization in cached_lookup()

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> d_lookup() takes rename_lock which is a seq_lock. This is so cheap
> it's not worth calling lockless __d_lookup() first from
> cache_lookup(). Rename cached_lookup() to cache_lookup() while we're
> there.

Val, this is another patch unrelated to union mounts, an
optimization/simplification of the VFS code. I think you need to try and
push such VFS patches upstream more quickly, so as to reduce the set of UM
patches you have to maintain.

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 13 ++++---------
> 1 files changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index e334f25..9c9ecfa 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -404,15 +404,10 @@ do_revalidate(struct dentry *dentry, struct nameidata *nd)
> * Internal lookup() using the new generic dcache.
> * SMP-safe
> */
> -static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
> +static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
> + struct nameidata *nd)
> {
> - struct dentry * dentry = __d_lookup(parent, name);
> -
> - /* lockess __d_lookup may fail due to concurrent d_move()
> - * in some unrelated directory, so try with d_lookup
> - */
> - if (!dentry)
> - dentry = d_lookup(parent, name);
> + struct dentry *dentry = d_lookup(parent, name);
>
> if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
> dentry = do_revalidate(dentry, nd);
> @@ -1191,7 +1186,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
> goto out;
> }
>
> - dentry = cached_lookup(base, name, nd);
> + dentry = cache_lookup(base, name, nd);
> if (!dentry) {
> struct dentry *new;
>
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 02:11:39

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 05/41] VFS: Make real_lookup() return a struct path

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> This patch changes real_lookup() into returning a struct path.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 82 +++++++++++++++++++++++++++++++++++++----------------------
> 1 files changed, 51 insertions(+), 31 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 9c9ecfa..a338496 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -462,10 +462,11 @@ ok:
> * make sure that nobody added the entry to the dcache in the meantime..
> * SMP-safe
> */
> -static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
> +static int real_lookup(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> {

Same comments I had on patch 3:

- document in comment and patch header the new @path parameter, who is
responsible for it, new return err, etc.

- consider adding BUG_ON(!path)

- perhaps VFS this should also be pushed upstream before UM

Erez.

2009-11-30 02:29:14

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 06/41] VFS: Introduce dput() variant that maintains a kill-list

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> This patch introduces a new variant of dput(). This becomes necessary to
> prevent a recursive call to dput() from the union mount code.
>
> void __dput(struct dentry *dentry, struct list_head *list, int greedy);
> struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
> int greedy);
>
> __dput() works mostly like the original dput() did. The main difference is
> that if it the greedy argument is zero it will put the parent on a special
> list instead of trying to get rid of it directly.
>
> Therefore the union mount code can safely call __dput() when it wants to get
> rid of underlying dentry references during a dput(). After calling __dput()
> or __d_kill() the caller must make sure that __d_kill_final() is called on all
> dentries on the kill list. __d_kill_final() is actually doing the
> dentry_iput() and is also dereferencing the parent.

>From the description above, there is something somewhat unclean about all
the special things that now have to happen: a special flags to affect how a
function behaves, an extra requirement on the caller of __d_kill, etc. I
wonder if there is a clear way to achieve this.

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 105 insertions(+), 10 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 38bf982..3415e9e 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -157,14 +157,19 @@ static void dentry_lru_del_init(struct dentry *dentry)
> }
>
> /**
> - * d_kill - kill dentry and return parent
> + * __d_kill - kill dentry and return parent
> * @dentry: dentry to kill
> + * @list: kill list
> + * @greedy: return parent instead of putting it on the kill list
> *
> * The dentry must already be unhashed and removed from the LRU.
> *
> - * If this is the root of the dentry tree, return NULL.
> + * If this is the root of the dentry tree, return NULL. If greedy is zero, we
> + * put the parent of this dentry on the kill list instead. The callers must
> + * make sure that __d_kill_final() is called on all dentries on the kill list.
> */
> -static struct dentry *d_kill(struct dentry *dentry)
> +static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
> + int greedy)

If you're keeping 'greedy' then perhaps make it a bool instead of 'int';
that way you don't have to pass an unclear '0' or '1' in the rest of the
code.

> +void __dput(struct dentry *, struct list_head *, int);

Can you move the __dput() code here and avoid the forward function
declaration?

Can __dput() be made static, or you need to call it from elsewhere. I
didn't see an extern for it in this patch. If there's an extern in another
patch, then it should be moved here.

> +static void __d_kill_final(struct dentry *dentry, struct list_head *list)
> +{

Your patch header says that the caller of __dput or _-d_kill must called
__d_kill_final. So shouldn't this be a non-static extern'ed function?

Either way, I suggest documenting in a comment above __d_kill_final() who
should call it and under what circumstances.


> + iput(inode);
> + }
> +
> + if (IS_ROOT(dentry))
> + parent = NULL;
> + else
> + parent = dentry->d_parent;
> + d_free(dentry);
> + __dput(parent, list, 1);
> +}
> +
> +/**
> + * d_kill - kill dentry and return parent
> + * @dentry: dentry to kill
> + *
> + * The dentry must already be unhashed and removed from the LRU.
> + *
> + * If this is the root of the dentry tree, return NULL.
> + */
> +static struct dentry *d_kill(struct dentry *dentry)
> +{
> + LIST_HEAD(mortuary);
> + struct dentry *parent;
> +
> + parent = __d_kill(dentry, &mortuary, 1);
> + while (!list_empty(&mortuary)) {
> + dentry = list_entry(mortuary.next, struct dentry, d_lru);
> + list_del(&dentry->d_lru);
> + __d_kill_final(dentry, &mortuary);
> + }
> +
> + return parent;
> +}
> +
> /*
> * This is dput
> *
> @@ -199,19 +266,24 @@ static struct dentry *d_kill(struct dentry *dentry)
> * Real recursion would eat up our stack space.
> */
>
> -/*
> - * dput - release a dentry
> - * @dentry: dentry to release
> +/**
> + * __dput - release a dentry
> + * @dentry: dentry to release
> + * @list: kill list argument for __d_kill()
> + * @greedy: greedy argument for __d_kill()
> *
> * Release a dentry. This will drop the usage count and if appropriate
> * call the dentry unlink method as well as removing it from the queues and
> * releasing its resources. If the parent dentries were scheduled for release
> - * they too may now get deleted.
> + * they too may now get deleted if @greedy is not zero. Otherwise parent is
> + * added to the kill list. The callers must make sure that __d_kill_final() is
> + * called on all dentries on the kill list.
> + *
> + * You probably want to use dput() instead.
> *
> * no dcache lock, please.
> */
> -
> -void dput(struct dentry *dentry)
> +void __dput(struct dentry *dentry, struct list_head *list, int greedy)
> {

I wonder now if the "__" prefix in __dput is appropriate: usually it's
reserved for "hidden" internal functions that are not supposed to be called
by other users, right? I try to avoid naming things FOO and __FOO because
the name alone doesn't help me understand what each one might be doing. So
maybe rename __dput() to something more descriptive?

Erez.

2009-11-30 02:33:35

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 07/41] VFS: Add read-only users count to superblock

In message <[email protected]>, Valerie Aurora writes:
> While we can check if a file system is currently read-only, we can't
> guarantee that it will stay read-only. The file system can be
> remounted read-write at any time; it's also conceivable that a file
> system can be mounted a second time and converted to read-write if the
> underlying fs allows it. This is a problem for union mounts, which
> require the underlying file system be read-only. Add a read-only
> users count and don't allow remounts to change the file system to
> read-write or read-write mounts if there are any read-only users.
>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/super.c | 14 ++++++++++++++
> include/linux/fs.h | 5 +++++
> 2 files changed, 19 insertions(+), 0 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 2761d3e..c8140ac 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -553,6 +553,15 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
> }
> remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);
>
> + /* If we are remounting read/write, make sure that none of the
> + users require read-only for correct operation (such as
> + union mounts). */

Minor nit: but I think multi-line comments look better like this:

/*
* text
*/

> + if (remount_rw && sb->s_readonly_users) {
> + printk(KERN_INFO "%s: In use by %d read-only user(s)\n",
> + sb->s_id, sb->s_readonly_users);
> + return -EROFS;
> + }
> +
> if (sb->s_op->remount_fs) {
> retval = sb->s_op->remount_fs(sb, &flags, data);
> if (retval)
> @@ -889,6 +898,11 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
> if (error)
> goto out_sb;
>
> + error = -EROFS;
> + if (!(flags & MS_RDONLY) &&
> + (mnt->mnt_sb->s_readonly_users))

Minor nit: two parts of '&&' in the above 'if' can go on same line and not
violate checkpatch.

> + goto out_sb;
> +
> mnt->mnt_mountpoint = mnt->mnt_root;
> mnt->mnt_parent = mnt;
> up_write(&mnt->mnt_sb->s_umount);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 73e9b64..5fb7343 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1379,6 +1379,11 @@ struct super_block {
> * generic_show_options()
> */
> char *s_options;
> +
> + /*
> + * Users who require read-only access - e.g., union mounts
> + */

Minor nit: for short one-line comments I prefer to save LoC:

/* text */

> + int s_readonly_users;
> };
>
> extern struct timespec current_fs_time(struct super_block *sb);
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 02:44:35

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 08/41] Don't replace nameidata path when following links

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> For autofs4 the commit 051d381259eb57d6074d02a6ba6e90e744f1a29f introduced
> some code that is replacing the path embedded in the nameidata with the
> path of the link itself. This was done to have access to the struct
> vfsmount in the autofs4_follow_link function. Instead autofs4 should
> remember the struct vfsmount when it is mounted.

This is an autofs4 patch, mainly: say so in the subject line:

VFS/Autofs4: don't replace nameidata ...

I'm curious why wasn't Ian Kent CC'ed on this patch originally? I added him
to the CC list now.

And what does this patch have to do with union mounts? Can you document why
you needed this change made?

Lastly, if this patch is acceptable to all parties, then it should be pushed
to the autofs4 maintainers and hopefully upstream well before UM.

> ---
> fs/autofs4/autofs_i.h | 1 +
> fs/autofs4/init.c | 11 ++++++++++-
> fs/autofs4/root.c | 6 ++++++
> fs/namei.c | 7 ++-----
> 4 files changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
> index 8f7cdde..db2bfce 100644
> --- a/fs/autofs4/autofs_i.h
> +++ b/fs/autofs4/autofs_i.h
> @@ -130,6 +130,7 @@ struct autofs_sb_info {
> int reghost_enabled;
> int needs_reghost;
> struct super_block *sb;
> + struct vfsmount *mnt;
> struct mutex wq_mutex;
> spinlock_t fs_lock;
> struct autofs_wait_queue *queues; /* Wait queue pointer */
> diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
> index 9722e4b..5e0dcd7 100644
> --- a/fs/autofs4/init.c
> +++ b/fs/autofs4/init.c
> @@ -17,7 +17,16 @@
> static int autofs_get_sb(struct file_system_type *fs_type,
> int flags, const char *dev_name, void *data, struct vfsmount *mnt)
> {
> - return get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
> + struct autofs_sb_info *sbi;
> + int ret;
> +
> + ret = get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
> + if (ret)
> + return ret;
> +
> + sbi = autofs4_sbi(mnt->mnt_sb);
> + sbi->mnt = mnt;
> + return 0;
> }
>
> static struct file_system_type autofs_fs_type = {
> diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
> index b96a3c5..cb991b8 100644
> --- a/fs/autofs4/root.c
> +++ b/fs/autofs4/root.c
> @@ -179,6 +179,12 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
> DPRINTK("dentry=%p %.*s oz_mode=%d nd->flags=%d",
> dentry, dentry->d_name.len, dentry->d_name.name, oz_mode,
> nd->flags);
> +
> + dput(nd->path.dentry);
> + mntput(nd->path.mnt);
> + nd->path.mnt = mntget(sbi->mnt);
> + nd->path.dentry = dget(dentry);
> +
> /*
> * For an expire of a covered direct or offset mount we need
> * to break out of follow_down() at the autofs mount trigger
> diff --git a/fs/namei.c b/fs/namei.c
> index a338496..46cf1cb 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -636,11 +636,8 @@ static __always_inline int __do_follow_link(struct path *path, struct nameidata
> touch_atime(path->mnt, dentry);
> nd_set_link(nd, NULL);
>
> - if (path->mnt != nd->path.mnt) {
> - path_to_nameidata(path, nd);
> - dget(dentry);
> - }
> - mntget(path->mnt);
> + if (path->mnt == nd->path.mnt)
> + mntget(nd->path.mnt);
> cookie = dentry->d_inode->i_op->follow_link(dentry, nd);
> error = PTR_ERR(cookie);
> if (!IS_ERR(cookie)) {

Just want to mention that the five lines you replace with the two lines, in
the above patch snippet, are not functionally equivalent. Is this the
intention of "reversing" what commit
051d381259eb57d6074d02a6ba6e90e744f1a29f introduced? If not, then please
explain the change in __do_follow_link.

Thanks,
Erez.

2009-11-30 02:54:15

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 09/41] whiteout: Don't return information about whiteouts to userspace

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> The userspace isn't ready for handling another filetype. Therefore this
> patch lets readdir() and others skip over the whiteout directory entries
> they might find.

The NFSD maintainers and MLs should be CC'ed on such patches which touch
fs/nfsd/. I'd also suggst you change the subject line of this patch to:

whiteout/NFSD: don't return ...

This patch seems fairly straightforward: it returns 0 when d_type is DT_WHT.
As long as there's no way to create such whiteout entries (not until UM is
used), then there's no harm in pushing such patches upstream, no?

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/compat.c | 9 +++++++++
> fs/nfsd/nfs3xdr.c | 5 +++++
> fs/nfsd/nfs4xdr.c | 2 +-
> fs/nfsd/nfsxdr.c | 4 ++++
> fs/readdir.c | 9 +++++++++
> 5 files changed, 28 insertions(+), 1 deletions(-)
>
> diff --git a/fs/compat.c b/fs/compat.c
> index 6d6f98f..43f6102 100644
> --- a/fs/compat.c
> +++ b/fs/compat.c
> @@ -847,6 +847,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
> struct compat_old_linux_dirent __user *dirent;
> compat_ulong_t d_ino;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> if (buf->result)
> return -EINVAL;
> d_ino = ino;
> @@ -918,6 +921,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
> compat_ulong_t d_ino;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> @@ -1007,6 +1013,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
> int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
> u64 off;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 01d4ec1..59576d0 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -884,6 +884,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
> int elen; /* estimated entry length in words */
> int num_entry_words = 0; /* actual number of words */
>
> + if (d_type == DT_WHT) {
> + cd->common.err = nfs_ok;
> + return 0;
> + }
> +
> if (cd->offset) {
> u64 offset64 = offset;
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 2dcc7fe..8c25012 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2263,7 +2263,7 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
> __be32 nfserr = nfserr_toosmall;
>
> /* In nfsv4, "." and ".." never make it onto the wire.. */
> - if (name && isdotent(name, namlen)) {
> + if (d_type == DT_WHT || (name && isdotent(name, namlen))) {

Optimization: I would swap the order of the two conditions separated by the
'||': the right-hand-side condition is far more likely to occur than
d_type==DT_WHT, so you can enter the body of the 'if' more quickly for the
common case.

> cd->common.err = nfs_ok;
> return 0;
> }
> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> index afd08e2..a7d622c 100644
> --- a/fs/nfsd/nfsxdr.c
> +++ b/fs/nfsd/nfsxdr.c
> @@ -513,6 +513,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
> namlen, name, offset, ino);
> */
>
> + if (d_type == DT_WHT) {
> + cd->common.err = nfs_ok;
> + return 0;
> + }
> if (offset > ~((u32) 0)) {
> cd->common.err = nfserr_fbig;
> return -EINVAL;
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 7723401..3a48491 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> struct old_linux_dirent __user * dirent;
> unsigned long d_ino;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> if (buf->result)
> return -EINVAL;
> d_ino = ino;
> @@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> unsigned long d_ino;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> @@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
> struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 03:04:49

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 10/41] whiteout: Add vfs_whiteout() and whiteout inode operation

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Simply white-out a given directory entry. This functionality is usually used
> in the sense of unlink. Therefore the given dentry can still be in-use and
> contains an in-use inode. The filesystems inode operation has to do what
> unlink or rmdir would in that case. Since the dentry still might be in-use
> we have to provide a fresh unhashed dentry that is used as the whiteout
> dentry instead. The given dentry is dropped and the whiteout dentry is
> rehashed instead.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 4 +-
> fs/namei.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dcache.h | 6 +++
> include/linux/fs.h | 3 +
> 4 files changed, 116 insertions(+), 1 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 3415e9e..0fcae4b 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1076,8 +1076,10 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
> /* the caller must hold dcache_lock */
> static void __d_instantiate(struct dentry *dentry, struct inode *inode)
> {
> - if (inode)
> + if (inode) {
> + dentry->d_flags &= ~DCACHE_WHITEOUT;
> list_add(&dentry->d_alias, &inode->i_dentry);
> + }
> dentry->d_inode = inode;
> fsnotify_d_instantiate(dentry, inode);
> }
> diff --git a/fs/namei.c b/fs/namei.c
> index 46cf1cb..d2fc8c9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2169,6 +2169,110 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
> return sys_mkdirat(AT_FDCWD, pathname, mode);
> }
>
> +
> +/* Checks on the victim for whiteout */
> +static inline int may_whiteout(struct inode *dir, struct dentry *victim,
> + int isdir)

Why not make 'isdir' a boolean?

I'd prefer to see more documentation above this function: explain what each
arg does, return value, etc.

> +{
> + int err;
> +
> + /* from may_create() */
> + if (IS_DEADDIR(dir))
> + return -ENOENT;
> + err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
> + if (err)
> + return err;
> +
> + /* from may_delete() */
> + if (IS_APPEND(dir))
> + return -EPERM;
> + if (!victim->d_inode)
> + return 0;
> + if (check_sticky(dir, victim->d_inode) ||
> + IS_APPEND(victim->d_inode) ||
> + IS_IMMUTABLE(victim->d_inode))
> + return -EPERM;
> + if (isdir) {
> + if (!S_ISDIR(victim->d_inode->i_mode))
> + return -ENOTDIR;
> + if (IS_ROOT(victim))
> + return -EBUSY;
> + } else if (S_ISDIR(victim->d_inode->i_mode))
> + return -EISDIR;
> + if (victim->d_flags & DCACHE_NFSFS_RENAMED)
> + return -EBUSY;
> + return 0;
> +}
> +
> +/**
> + * vfs_whiteout: creates a white-out for the given directory entry
> + * @dir: parent inode
> + * @dentry: directory entry to white-out

Nit: is it 'white-out' or 'whiteout'? Whatever you choose is fine, but
please use consistent hypenation/spelling everywhere (code, comments, and
documentation).

> + *
> + * Simply white-out a given directory entry. This functionality is usually used
> + * in the sense of unlink. Therefore the given dentry can still be in-use and
> + * contains an in-use inode. The filesystem has to do what unlink or rmdir

Nit: other than the line of comment just above, the other two instances of
"in-use" in this comment (and the patch header) should be changed to "in
use" (no hyphen).

> + * would in that case. Since the dentry still might be in-use we have to
> + * provide a fresh unhashed dentry that whiteout can fill the new inode into.
> + * In that case the given dentry is dropped and the fresh dentry containing the
> + * whiteout is rehashed instead. If the given dentry is unused, the whiteout
> + * inode is instantiated into it instead.
> + *
> + * After this returns with success, don't make any assumptions about the inode.

What kinds of assumptions one should not make? Perhaps it'd be better to
document what you can/should assume, instead of what you shouldn't (or
both?)

> + * Just dput() it dentry.

The last line is awkward: do you mean "its dentry"?

> + */
> +int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> +{
> + int err;
> + struct inode *old_inode = dentry->d_inode;
> + struct dentry *parent, *whiteout;
> +
> + err = may_whiteout(dir, dentry, isdir);
> + if (err)
> + return err;
> +
> + BUG_ON(dentry->d_parent->d_inode != dir);
> +
> + if (!dir->i_op || !dir->i_op->whiteout)
> + return -EOPNOTSUPP;
> +
> + if (old_inode) {
> + vfs_dq_init(dir);
> +
> + mutex_lock(&old_inode->i_mutex);
> + if (isdir)
> + dentry_unhash(dentry);
> + if (d_mountpoint(dentry))
> + err = -EBUSY;
> + else {
> + if (isdir)
> + err = security_inode_rmdir(dir, dentry);
> + else
> + err = security_inode_unlink(dir, dentry);
> + }
> + }
> +
> + parent = dget_parent(dentry);
> + whiteout = d_alloc_name(parent, dentry->d_name.name);
> +
> + if (!err)
> + err = dir->i_op->whiteout(dir, dentry, whiteout);
> +
> + if (old_inode) {
> + mutex_unlock(&old_inode->i_mutex);
> + if (!err) {
> + fsnotify_link_count(old_inode);
> + d_delete(dentry);
> + }
> + if (isdir)
> + dput(dentry);
> + }
> +
> + dput(whiteout);
> + dput(parent);
> + return err;
> +}
> +
> /*
> * We try to drop the dentry early: we should have
> * a usage count of 2 if we're the only user of this
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 30b93b2..7648b49 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -183,6 +183,7 @@ d_iput: no no no yes
> #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched by inotify */
>
> #define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
> +#define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */
>
> #define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */
>
> @@ -358,6 +359,11 @@ static inline int d_unlinked(struct dentry *dentry)
> return d_unhashed(dentry) && !IS_ROOT(dentry);
> }
>
> +static inline int d_is_whiteout(struct dentry *dentry)
> +{
> + return (dentry->d_flags & DCACHE_WHITEOUT);
> +}
> +
> static inline struct dentry *dget_parent(struct dentry *dentry)
> {
> struct dentry *ret;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5fb7343..04a9870 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -205,6 +205,7 @@ struct inodes_stat_t {
> #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
> #define MS_I_VERSION (1<<23) /* Update inode I_version field */
> #define MS_STRICTATIME (1<<24) /* Always perform atime updates */
> +#define MS_WHITEOUT (1<<26) /* fs does support white-out filetype */
> #define MS_ACTIVE (1<<30)
> #define MS_NOUSER (1<<31)
>
> @@ -1422,6 +1423,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> extern int vfs_rmdir(struct inode *, struct dentry *);
> extern int vfs_unlink(struct inode *, struct dentry *);
> extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int vfs_whiteout(struct inode *, struct dentry *, int);
>
> /*
> * VFS dentry helper functions.
> @@ -1526,6 +1528,7 @@ struct inode_operations {
> int (*mkdir) (struct inode *,struct dentry *,int);
> int (*rmdir) (struct inode *,struct dentry *);
> int (*mknod) (struct inode *,struct dentry *,int,dev_t);
> + int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
> int (*rename) (struct inode *, struct dentry *,
> struct inode *, struct dentry *);
> int (*readlink) (struct dentry *, char __user *,int);

Nit: I'm curious why you decided to add the function proto for (*whiteout)
where you did? inode_operations isn't sorted alphabetically, so why not
just append it to the end of the op list?

> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 06:04:47

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 03/41] VFS: Make lookup_hash() return a struct path

In message <[email protected]>, Valerie Aurora writes:

> @@ -1937,7 +1942,8 @@ EXPORT_SYMBOL(filp_open);
> */
> struct dentry *lookup_create(struct nameidata *nd, int is_dir)
> {
> - struct dentry *dentry = ERR_PTR(-EEXIST);
> + struct path path = { .dentry = ERR_PTR(-EEXIST) } ;

I assume the compiler will initialize path.mnt to NULL. Is NULL what you
want? Even if the compiler guarantees it, I think you should either
explicitly init .mnt to NULL or leave a comment explaining what's going on
-- so no future code reader will think that this was omitted; a comment can
clarify your intentions more explicitly.

A struct path often requires both .mnt and .dentry to be set; it's not like,
say, inode_operations, where clearly some fields can be initialized to NULL
just fine.

Erez.

2009-11-30 06:14:07

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 12/41] union-mount: Allow removal of a directory

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> do_whiteout() allows removal of a directory when it has whiteouts but
> is logically empty.
>
> XXX - This patch abuses readdir() to check if the union directory is
> logically empty - that is, all the entries are whiteouts (or "." or
> ".."). Currently, we have no clean VFS interface to ask the lower
> file system if a directory is empty.
>
> Fixes:
> - Add ->is_directory_empty() op
> - Add is_directory_empty flag to dentry (ugly dcache populate)
> - Ask underlying fs to remove it and look for an error return
> - (your idea here)

Yeah, this is a difficult issue. I think the best way would be to

1. add an OPTIONAL ->is_directory_empty() inode op.

2. have the VFS use some default/generic behavior ala filldir_is_empty()
below if inode->i_op->is_directory_empty is NULL. I assume this behavior
will only need to be checked for file systems that support whiteouts in
the first place.

This'll provide some working behavior for all whiteout-supporting file
systems, but allow anyone who wants to develop a more efficient method to
provide one.

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 85 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 5da1635..9a62c75 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2284,6 +2284,91 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> }
>
> /*
> + * This is abusing readdir to check if a union directory is logically empty.
> + * Al Viro barfed when he saw this, but Val said: "Well, at this point I'm
> + * aiming for working, pretty can come later"
> + */
> +static int filldir_is_empty(void *__buf, const char *name, int namlen,
> + loff_t offset, u64 ino, unsigned int d_type)
> +{

Why not make filldir_is_empty() return a bool? That explains more clearly
the function's return code.

> +static int directory_is_empty(struct dentry *dentry, struct vfsmount *mnt)
> +{

This can also return a bool.

> +static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
> +{

'isdir' can be bool.

> + struct path safe = { .dentry = dget(nd->path.dentry),
> + .mnt = mntget(nd->path.mnt) };
> + struct dentry *dentry = path->dentry;
> + int err;

You might want to move the initialization of 'struct path safe' down below,
and add a BUG_ON(!nd) before that. I think during the development phases of
UM, it's a good idea to have a few more debugging BUG_ON's.

Erez.

2009-11-30 06:27:21

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 13/41] whiteout: tmpfs whiteout support

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Add support for whiteout dentries to tmpfs.

Shouldn't you CC Hugh Dickins here? He's probably best positioned to review
the changes in mm/shmem.c.

> XXX - Not sure this is the right patch to put the code for supporting
> whiteouts in d_genocide().
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 3 +-
> mm/shmem.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 137 insertions(+), 15 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 0fcae4b..1fae1df 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2280,7 +2280,8 @@ resume:
> struct list_head *tmp = next;
> struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> next = tmp->next;
> - if (d_unhashed(dentry)||!dentry->d_inode)
> + if (d_unhashed(dentry)||(!dentry->d_inode &&
> + !d_is_whiteout(dentry)))

I think this d_genocide patch should go elsewhere. What does it have to do
with tmpfs?

Also, is your logic above correct? If I understood d_genocide correctly,
then the code you changed attempts to skip over dentries for which
d_genocide has no work to do, like unhashed and negative dentries. So I
assume it should also skip over whiteout dentries. Your condition is

if (d_unhashed(dentry) || (!dentry->d_inode && !d_is_whiteout(dentry)))

but perhaps it needs to be

if (d_unhashed(dentry) || !dentry->d_inode || d_is_whiteout(dentry))

No?

Either way, you may want to document any complex conditional that may be
confusing to parse.

> continue;
> if (!list_empty(&dentry->d_subdirs)) {
> this_parent = dentry;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d713239..2faa14b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
[mm/shmem.c changes snipped]

Erez.

2009-11-30 06:32:34

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 14/41] whiteout: Split of ext2_append_link() from ext2_add_link()

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> The ext2_append_link() is later used to find or append a directory
> entry to whiteout.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Theodore Tso <[email protected]>
> Cc: [email protected]
> ---
> fs/ext2/dir.c | 70 ++++++++++++++++++++++++++++++++++++++++----------------
> 1 files changed, 50 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index 6cde970..cb8ceff 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
> }
>
> /*
> - * Parent is locked.
> + * Find or append a given dentry to the parent directory
> */
> -int ext2_add_link (struct dentry *dentry, struct inode *inode)
> +static ext2_dirent * ext2_append_entry(struct dentry * dentry,
> + struct page ** page)

I thought checkpatch didn't want to see spaces after a '*', so
"struct foo * ptr" should become "struct foo *ptr".

I also think that "struct page **page" should be renamed to "struct page
**ppage" or "struct page **pages", to avoid confusion with many other
functions which pass a "struct page *" pointer to a variable named "page".

Erez.

2009-11-30 07:46:14

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 15/41] whiteout: ext2 whiteout support

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> This patch adds whiteout support to EXT2. A whiteout is an empty directory
> entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
> allocates space in directories. Due to being implemented as a filetype it is
> necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.
>
> XXX - Whiteouts could be implemented as special symbolic links
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Theodore Tso <[email protected]>
> Cc: [email protected]
> ---
> fs/ext2/dir.c | 96 +++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext2/ext2.h | 3 +
> fs/ext2/inode.c | 11 ++++-
> fs/ext2/namei.c | 65 ++++++++++++++++++++++++++++++-
> fs/ext2/super.c | 7 +++
> include/linux/ext2_fs.h | 4 ++
> 6 files changed, 176 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index cb8ceff..d4628c0 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name,
> {
> if (len != de->name_len)
> return 0;
> - if (!de->inode)
> + if (!de->inode && (de->file_type != EXT2_FT_WHT))

The extra parens around (de->file_type != EXT2_FT_WHT) don't hurt but are
unnecessary. Ditto in a couple of other places in this patch.

Erez.

2009-11-30 07:51:32

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 16/41] whiteout: jffs2 whiteout support

In message <[email protected]>, Valerie Aurora writes:
> From: Felix Fietkau <[email protected]>
>
> Add support for whiteout dentries to jffs2.
>
> Signed-off-by: Felix Fietkau <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: David Woodhouse <[email protected]>
> Cc: [email protected]
> ---
> fs/jffs2/dir.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++-
> fs/jffs2/fs.c | 4 ++
> fs/jffs2/super.c | 2 +-
> include/linux/jffs2.h | 2 +
> 4 files changed, 82 insertions(+), 3 deletions(-)
>
> diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> index 6f60cc9..46a2e1b 100644
> --- a/fs/jffs2/dir.c
> +++ b/fs/jffs2/dir.c
> @@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
> static int jffs2_rename (struct inode *, struct dentry *,
> struct inode *, struct dentry *);
>
> +static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
> +
> const struct file_operations jffs2_dir_operations =
> {
> .read = generic_read_dir,
> @@ -55,6 +57,7 @@ const struct inode_operations jffs2_dir_inode_operations =
> .rmdir = jffs2_rmdir,
> .mknod = jffs2_mknod,
> .rename = jffs2_rename,
> + .whiteout = jffs2_whiteout,
> .permission = jffs2_permission,
> .setattr = jffs2_setattr,
> .setxattr = jffs2_setxattr,
> @@ -98,8 +101,18 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
> fd = fd_list;
> }
> }
> - if (fd)
> - ino = fd->ino;
> + if (fd) {
> + spin_lock(&target->d_lock);
> + switch(fd->type) {
> + case DT_WHT:
> + target->d_flags |= DCACHE_WHITEOUT;
> + break;
> + default:
> + ino = fd->ino;
> + break;
> + }
> + spin_unlock(&target->d_lock);
> + }

The switch statement above should be simplified into this:

if (fd->type == DT_WHT)
target->d_flags |= DCACHE_WHITEOUT;
else
ino = fd->ino;

> + /* If it's a directory, then check whether it is really empty
> + */

Format above comment on one line.

Erez.

2009-11-30 08:04:17

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 17/41] whiteout: Add path_whiteout() helper

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Add a path_whiteout() helper for vfs_whiteout().
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 15 ++++++++++++++-
> include/linux/fs.h | 1 -
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 9a62c75..408380d 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2231,7 +2231,7 @@ static inline int may_whiteout(struct inode *dir, struct dentry *victim,
> * After this returns with success, don't make any assumptions about the inode.
> * Just dput() it dentry.
> */
> -int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> +static int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)

Didn't some other patch introduce vfs_whiteout? So why have a second patch
which makes vfs_whiteout a static? Why not introduce both vfs_whiteout and
path_whiteout in one patch?

> {
> int err;
> struct inode *old_inode = dentry->d_inode;
> @@ -2283,6 +2283,19 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> return err;
> }
>
> +int path_whiteout(struct path *dir_path, struct dentry *dentry, int isdir)

Please document the behavior of path_whiteout in a proper comment above ii
(kernel-doc). Describe return values, side effects, etc.

Also, isdir in both vfs_whiteout and path_whiteout can be boolean.

Erez.

2009-11-30 08:03:16

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 19/41] union-mount: Introduce MNT_UNION and MS_UNION flags

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Add per mountpoint flag for Union Mount support. You need additional patches
> to util-linux for that to work - see:
>
> git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namespace.c | 5 ++++-
> include/linux/fs.h | 1 +
> include/linux/mount.h | 1 +
> 3 files changed, 6 insertions(+), 1 deletions(-)
[...]

> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 5d52753..e175c47 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -35,6 +35,7 @@ struct mnt_namespace;
> #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
> #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
> #define MNT_PNODE_MASK 0x3000 /* propagation flag mask */
> +#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */

I it correct to just add another flag here? How does it relate to this
'propagation mask' right above it? If there's some code out there which
masks out which MNT flags get propagated and which don't, then you need to
make a decision whether MNT_UNION needs to be propagated as well. Either
way, please document your decision in a comment here so no one will have to
ask the same question again.

Erez.

2009-11-30 08:47:05

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 20/41] union-mount: Introduce union_mount structure

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> This patch adds the basic structures of VFS based union mounts. It is a new
> implementation based on some of my old ideas that influenced Bharata B Rao
> <[email protected]> who came up with the proposal to let the
> union_mount struct only point to the next layer in the union stack. I rewrote
> nearly all of the central patches around lookup and the dcache interaction.
>
> Advantages of the new implementation:
> - the new union stack is no longer tied directly to one dentry
> - the union stack enables dentries to be part of more than one union
> (bind mounts)
> - it is unnecessary to traverse the union stack when de/referencing a dentry
> - caching of union stack information still driven by dentry cache
>
> XXX - is_unionized() is pretty heavy-weight for non-union file systems
> on a union mount-enabled kernel. May be simplified by assuming one or
> more of:
>
> - Two layers only
> - One-to-one association between layers (doesn't union submounts)
> - Writable layer mounted in only one place

Yes, is_unionized() does appear to be heavy. Is it correct to assume that
every such dentry will have gotten looked up or traversed as part of a
union? If so, can we just set a flag in the dentry to mark it as
D_THIS_IS_PART_OF_A_UNION? Even if you could, what happens when a union r-w
layer is removed: could there be leftover dentries marked as part of a
union, which are no longer really part of it?

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/Kconfig | 13 ++
> fs/Makefile | 1 +
> fs/dcache.c | 4 +
> fs/union.c | 332 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dcache.h | 9 ++
> include/linux/union.h | 61 +++++++++
> 6 files changed, 420 insertions(+), 0 deletions(-)
> create mode 100644 fs/union.c
> create mode 100644 include/linux/union.h
[...]

> diff --git a/fs/union.c b/fs/union.c
> new file mode 100644
> index 0000000..d1950c2
> --- /dev/null
> +++ b/fs/union.c
> @@ -0,0 +1,332 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
> + * Copyright (C) 2007-2009 Novell Inc.
> + *
> + * Author(s): Jan Blunck ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include <linux/bootmem.h>
> +#include <linux/init.h>
> +#include <linux/types.h>
> +#include <linux/hash.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/fs_struct.h>
> +#include <linux/union.h>
> +
> +/*
> + * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
> + * should try to make this good - I've just made it work.
> + */
> +static unsigned int union_hash_mask __read_mostly;
> +static unsigned int union_hash_shift __read_mostly;
> +static struct hlist_head *union_hashtable __read_mostly;
> +static unsigned int union_rhash_mask __read_mostly;
> +static unsigned int union_rhash_shift __read_mostly;
> +static struct hlist_head *union_rhashtable __read_mostly;
> +
> +/*
> + * Locking Rules:
> + * - dcache_lock (for union_rlookup() only)
> + * - union_lock
> + */

Locking rules are pretty important to detail some more here, even if it just
repeats what's in union-mounts.txt. How/when are each of these two locks
used here?

> +void union_put(struct union_mount *um)
> +{
> + struct path tmp = um->u_next;

Are you relying on compiler support for C structure copying in the above
assignment? I try to avoid such confusion in general. It seems safer (and
saves you a bit on stack space) to change the code to:

struct path *tmp = &um->u_next;
if (__union_put(um))
path_put(tmp);

> + if (__union_put(um))
> + path_put(&tmp);
> +}

> +/*
> + * is_unionized - check if a dentry lives on a union mounted file system
> + *
> + * This tests if a dentry is living on an union mounted file system by walking
> + * the file system hierarchy.
> + */
> +int is_unionized(struct dentry *dentry, struct vfsmount *mnt)

This can be a boolean function.

> +{
> + struct path this = { .mnt = mntget(mnt),
> + .dentry = dget(dentry) };
> + struct vfsmount *tmp;
> +
> + do {
> + /* check if there is an union mounted on top of us */
> + spin_lock(&vfsmount_lock);
> + list_for_each_entry(tmp, &this.mnt->mnt_mounts, mnt_child) {
> + if (!(tmp->mnt_flags & MNT_UNION))
> + continue;
> + /* Isn't this a bug? */

It's customary to prefix such comments with XXX, as it helps those who like
to grep for issues:

/* XXX: isn't this a bug? */

> + if (this.dentry->d_sb != tmp->mnt_mountpoint->d_sb)
> + continue;
> + if (is_subdir(this.dentry, tmp->mnt_mountpoint)) {
> + spin_unlock(&vfsmount_lock);
> + path_put(&this);
> + return 1;
> + }
> + }
> + spin_unlock(&vfsmount_lock);
> +
> + /* check our mountpoint next */
> + tmp = mntget(this.mnt->mnt_parent);
> + dput(this.dentry);
> + this.dentry = dget(this.mnt->mnt_mountpoint);
> + mntput(this.mnt);
> + this.mnt = tmp;
> + } while (this.mnt != this.mnt->mnt_parent);
> +
> + path_put(&this);
> + return 0;
> +}
> +

> +/*
> + * follow_union_down - follow the union stack one layer down
> + *
> + * This is called to traverse the union stack from one layer to the next
> + * overlayed one. follow_union_down() is called by various lookup functions
> + * that are aware of union mounts.
> + *
> + * Returns non-zero if followed to the next layer, zero otherwise.

But, you're returning a 1 or 0 always, so why not make it a bool function?
Or do you think this function could ever return something different
(-ERRNO)?

> + */
> +int follow_union_down(struct vfsmount **mnt, struct dentry **dentry)
> +{
> + struct union_mount *um;
> +
> + if (!IS_MNT_UNION(*mnt))
> + return 0;
> +
> + spin_lock(&union_lock);
> + um = union_lookup(*dentry, *mnt);
> + spin_unlock(&union_lock);
> + if (um) {
> + path_get(&um->u_next);
> + dput(*dentry);
> + *dentry = um->u_next.dentry;
> + mntput(*mnt);
> + *mnt = um->u_next.mnt;
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> + * follow_union_mount - follow the union stack to the topmost layer
> + *
> + * This is called to traverse the union stack to the topmost layer. This is
> + * necessary for following parent pointers in an union mount.
> + *
> + * Returns none zero if followed to the topmost layer, zero otherwise.

s/none zero/non-zero/

Either way, this function returns 0/1, and can be made boolean until such
day that it has to return something other than a 0 or 1.

> + */
> +int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
> +{
> + struct union_mount *um;
> + int res = 0;
> +
> + while (IS_UNION(*dentry)) {
> + spin_lock(&dcache_lock);
> + spin_lock(&union_lock);
> + um = union_rlookup(*dentry, *mnt);
> + if (um)
> + path_get(&um->u_this);
> + spin_unlock(&union_lock);
> + spin_unlock(&dcache_lock);
> +
> + /*
> + * Q: Aaargh, how do I validate the topmost dentry pointer?
> + * A: Eeeeasy! We took the dcache_lock and union_lock. Since
> + * this protects from any dput'ng going on, we know that the
> + * dentry is valid since the union is unhashed under
> + * dcache_lock too.
> + */
> + if (!um)
> + break;
> + dput(*dentry);
> + *dentry = um->u_this.dentry;
> + mntput(*mnt);
> + *mnt = um->u_this.mnt;
> + res = 1;
> + }
> +
> + return res;
> +}
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 7648b49..4d48c20 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -101,6 +101,15 @@ struct dentry {
> struct dentry *d_parent; /* parent directory */
> struct qstr d_name;
>
> +#ifdef CONFIG_UNION_MOUNT
> + /*
> + * The following fields are used by the VFS based union mount
> + * implementation. Both are protected by union_lock!
> + */
> + struct list_head d_unions; /* list of union_mount's */
> + unsigned int d_unionized; /* unions referencing this dentry */

So what exactly is d_unionized? A reference counter? And integer index
into something? Or just a flag to say whether this dentry is in a union or
not? Whatever the meaning of this field is, I don't think the comment next
to it properly explains what it does.

In general, I like to see some more detail explaining the use of these two
fields in your modified struct dentry. Header files are a great place to
demystify the meaning and use of new fields/structures, without forcing
everyone to read the *.c files to understand how things work. Besides, if
you're going to modify something as critical as STRUCT DENTRY, then I think
that some more explanation and justification is due.

> +#endif
> +
> struct list_head d_lru; /* LRU list */
> /*
> * d_child and d_rcu can share memory
> diff --git a/include/linux/union.h b/include/linux/union.h
> new file mode 100644
> index 0000000..0c85312
> --- /dev/null
> +++ b/include/linux/union.h
> @@ -0,0 +1,61 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
> + * Copyright (C) 2007 Novell Inc.
> + * Author(s): Jan Blunck ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +#ifndef __LINUX_UNION_H
> +#define __LINUX_UNION_H
> +#ifdef __KERNEL__
> +
> +#include <linux/list.h>
> +#include <asm/atomic.h>
> +
> +struct dentry;
> +struct vfsmount;
> +
> +#ifdef CONFIG_UNION_MOUNT
> +
> +/*
> + * The new union mount structure.
> + */
> +struct union_mount {
> + atomic_t u_count; /* reference count */
> + struct mutex u_mutex;
> + struct list_head u_unions; /* list head for d_unions */
> + struct hlist_node u_hash; /* list head for searching */
> + struct hlist_node u_rhash; /* list head for reverse searching */
> +
> + struct path u_this; /* this is me */
> + struct path u_next; /* this is what I overlay */
> +};

I have two major complaints about struct union_mount:

1. In your documentaiton you refer to the "upper" and "lower" layers. In
this structure, and all over this patch, you're referring to them as
"this" and "next", respectively. I don't care which pair of terms you
use, but please pick one pair of terms and use them consistently
throughout your ENTIRE code base and documentation. Personally, after
trying all sorts of terms myself over the years, I found that "upper" and
"lower" made the most sense, better than "this" and "next". To me, every
function which takes a paramater, that parameter *is* the "this" of that
function. For example, dput() takes a dentry, and that dentry could be
easily defined as "struct dentry *this" to mean that dput() operates on
that specific dentry. Now, I realize that someone could confuse "upper"
to mean "the layer above this one", so I'll accept if you decide to go
with this/next instead; but whatever you do, please be consistent.

2. The field prefixes of u_XXX are confusing at first glance. When I see
u_XXX in C code, my immediate reaction is "oh, this must be a typedef to
some unsigned type like u_int". I strongly suggest you rename all field
prefixes to um_XXX, as it is more traditionally done (taking the
"initials" of each word in the struct name).

> +
> +#define IS_UNION(dentry) (!list_empty(&(dentry)->d_unions) || \
> + (dentry)->d_unionized)
> +#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
> +
> +extern int is_unionized(struct dentry *, struct vfsmount *);
> +extern int append_to_union(struct vfsmount *, struct dentry *,
> + struct vfsmount *, struct dentry *);
> +extern int follow_union_down(struct vfsmount **, struct dentry **);
> +extern int follow_union_mount(struct vfsmount **, struct dentry **);
> +
> +#else /* CONFIG_UNION_MOUNT */
> +
> +#define IS_UNION(x) (0)
> +#define IS_MNT_UNION(x) (0)
> +#define is_unionized(x, y) (0)
> +#define append_to_union(x1, y1, x2, y2) ({ BUG(); (0); })
> +#define follow_union_down(x, y) ({ (0); })
> +#define follow_union_mount(x, y) ({ (0); })
> +
> +#endif /* CONFIG_UNION_MOUNT */
> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_UNION_H */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 08:58:16

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 21/41] union-mount: Drive the union cache via dcache

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> If a dentry is removed from dentry cache because its usage count drops to
> zero, the references to the underlying layer of the unions the dentry is in
> are droped too. Therefore the union cache is driven by the dentry cache.

Hmm, in my review for patch 20, I suggested a way to simplify is_unionized()
by marking relevant dentries with a flag whether they are in a union or
not. If you're driving the entire union cache from the dcache, can't this
be done easily then?

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 10 ++++++-
> fs/union.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dcache.h | 8 +++++
> include/linux/union.h | 6 ++++
> 4 files changed, 97 insertions(+), 1 deletions(-)

> diff --git a/fs/union.c b/fs/union.c
> index d1950c2..6b99393 100644
> --- a/fs/union.c
> +++ b/fs/union.c
> @@ -14,6 +14,7 @@
>
> #include <linux/bootmem.h>
> #include <linux/init.h>
> +#include <linux/module.h>
> #include <linux/types.h>
> #include <linux/hash.h>
> #include <linux/fs.h>
> @@ -255,6 +256,8 @@ int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
> union_put(this);
> return 0;
> }
> + list_add(&this->u_unions, &dentry->d_unions);
> + dest_dentry->d_unionized++;
> __union_hash(this);
> spin_unlock(&union_lock);
> return 0;
> @@ -330,3 +333,74 @@ int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
>
> return res;
> }
> +
> +/*
> + * This must be called when unhashing a dentry. This is called with dcache_lock
> + * and unhashes all unions this dentry is in.
> + */
> +void __d_drop_unions(struct dentry *dentry)
> +{
> + struct union_mount *this, *next;
> +
> + spin_lock(&union_lock);
> + list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions)
> + __union_unhash(this);
> + spin_unlock(&union_lock);
> +}
> +EXPORT_SYMBOL_GPL(__d_drop_unions);

I thought the convention was that internal functions prefixed with __ are
.. internal, not to be extern'ed and exported.

Besides, why export this symbol? Which modules need it in your patchset?

> +/*
> + * This must be called after __d_drop_unions() without holding any locks.
> + * Note: The dentry might still be reachable via a lookup but at that time it
> + * already a negative dentry. Otherwise it would be unhashed. The union_mount
> + * structure itself is still reachable through mnt->mnt_unions (which we
> + * protect against with union_lock).
> + */
> +void shrink_d_unions(struct dentry *dentry)
> +{
> + struct union_mount *this, *next;

See my comments about this/that vs. upper/lower in my review for patch #20.

> +repeat:
> + spin_lock(&union_lock);
> + list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
> + BUG_ON(!hlist_unhashed(&this->u_hash));
> + BUG_ON(!hlist_unhashed(&this->u_rhash));
> + list_del(&this->u_unions);
> + this->u_next.dentry->d_unionized--;
> + spin_unlock(&union_lock);
> + union_put(this);
> + goto repeat;
> + }
> + spin_unlock(&union_lock);
> +}
> +
> +extern void __dput(struct dentry *, struct list_head *, int);

Why this extern here? Isn't there some header file you can #include more
cleanly at the top of this .c file?

> static inline void d_drop(struct dentry *dentry)
> diff --git a/include/linux/union.h b/include/linux/union.h
> index 0c85312..b035a82 100644
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -46,6 +46,9 @@ extern int append_to_union(struct vfsmount *, struct dentry *,
> struct vfsmount *, struct dentry *);
> extern int follow_union_down(struct vfsmount **, struct dentry **);
> extern int follow_union_mount(struct vfsmount **, struct dentry **);
> +extern void __d_drop_unions(struct dentry *);
> +extern void shrink_d_unions(struct dentry *);
> +extern void __shrink_d_unions(struct dentry *, struct list_head *);

Again, I don't understand why the two out of three functions above are
prefixed with __ while one of them isn't. I always prefer to name things
for what they actually do, not rely on magic prefixes and conventions to
guess what it means. I suggest trying to find better names to avoid having
so many FOO and __FOO names in this entire series of patchset.

Erez.

2009-11-30 09:05:24

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 22/41] union-mount: Some checks during namespace changes

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Add some additional checks when mounting something into an union.

The text "add some additional checks" seems to imply "code I forgot to add
earlier". If this patch is better merged with some other one, then do so;
otherwise, the subject line and patch header need to give this patch more
credit than this: better explain what the checks are more precisely.

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namespace.c | 34 ++++++++++++++++++++++++++++++++++
> 1 files changed, 34 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 81b3188..dc01385 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -29,6 +29,7 @@
> #include <linux/log2.h>
> #include <linux/idr.h>
> #include <linux/fs_struct.h>
> +#include <linux/union.h>
> #include <asm/uaccess.h>
> #include <asm/unistd.h>
> #include "pnode.h"
> @@ -1427,6 +1428,10 @@ static int do_change_type(struct path *path, int flag)
> if (path->dentry != path->mnt->mnt_root)
> return -EINVAL;
>
> + /* Don't change the type of union mounts */
> + if (IS_MNT_UNION(path->mnt))
> + return -EINVAL;
> +
> down_write(&namespace_sem);
> if (type == MS_SHARED) {
> err = invent_group_ids(mnt, recurse);
> @@ -1478,6 +1483,18 @@ static int do_loopback(struct path *path, char *old_name, int recurse,
> if (!mnt)
> goto out;
>
> + /*
> + * Unions couldn't be writable if the filesystem doesn't know about
> + * whiteouts
> + */
> + err = -ENOTSUPP;
> + if ((mnt_flags & MNT_UNION) &&
> + !(mnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
> + goto out;
> +
> + if (mnt_flags & MNT_UNION)
> + mnt->mnt_flags |= MNT_UNION;
> +
> err = graft_tree(mnt, path);
> if (err) {
> LIST_HEAD(umount_list);
> @@ -1571,6 +1588,13 @@ static int do_move_mount(struct path *path, char *old_name)
> if (err)
> return err;
>
> + /* moving to or from a union mount is not supported */
> + err = -EINVAL;
> + if (IS_MNT_UNION(path->mnt))
> + goto exit;
> + if (IS_MNT_UNION(old_path.mnt))
> + goto exit;
> +
> down_write(&namespace_sem);
> while (d_mountpoint(path->dentry) &&
> follow_down(path))
> @@ -1628,6 +1652,7 @@ out:
> up_write(&namespace_sem);
> if (!err)
> path_put(&parent_path);
> +exit:
> path_put(&old_path);
> return err;
> }

I'd avoid using 'exit' as a label and goto name; use "out_err" or some other
out_XXX label. 'exit' is almost a reserved word in C. :-)

> @@ -1685,6 +1710,15 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
> if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
> goto unlock;
>
> + /*
> + * Unions couldn't be writable if the filesystem doesn't know about
> + * whiteouts
> + */
> + err = -ENOTSUPP;
> + if ((mnt_flags & MNT_UNION) &&
> + !(newmnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
> + goto unlock;
> +
> newmnt->mnt_flags = mnt_flags;
> if ((err = graft_tree(newmnt, path)))
> goto unlock;
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-11-30 09:16:12

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 23/41] union-mount: Changes to the namespace handling

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Creates the proper struct union_mount when mounting something into a
> union. If the topmost filesystem isn't capable of handling the white-out
> filetype it could only be mount read-only.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namespace.c | 7 ++++++
> fs/union.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/mount.h | 3 ++
> include/linux/union.h | 10 +++++++-
> 4 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c

I'm curious if this fs/namespace.c patch could be merged with another
fs/namespace.c patch.

> diff --git a/include/linux/union.h b/include/linux/union.h
> index b035a82..0b6f356 100644
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -30,8 +30,9 @@ struct union_mount {
> atomic_t u_count; /* reference count */
> struct mutex u_mutex;
> struct list_head u_unions; /* list head for d_unions */
> - struct hlist_node u_hash; /* list head for searching */
> - struct hlist_node u_rhash; /* list head for reverse searching */
> + struct list_head u_list; /* list head for mnt_unions */
> + struct hlist_node u_hash; /* list head for seaching */

s/seaching/searching/

M-x ispell-comments-and-strings to the rescue.

> + struct hlist_node u_rhash; /* list head for reverse seaching */

A previous patch introduced struct union_mount; this patch modifies it. Why
not introduce the final struct union_mount only once in this patchset? It's
somewhat frustrating to have to read a patch as critical as the one defining
major new data structures, and try to understand it, only to find out that
the data structure is about to undergo major surgery later on.

Erez.

2009-12-01 04:11:10

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 24/41] union-mount: Make lookup work for union-mounted file systems

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> On union-mounted file systems the lookup function must also visit lower layers
> of the union-stack when doing a lookup. This patches add support for
> union-mounts to cached lookups and real lookups.
>
> We have 3 different styles of lookup functions now:
> - multiple pathname components, follow mounts, follow union, follow symlinks
> - single pathname component, doesn't follow mounts, follow union, doesn't
> follow symlinks
> - single pathname component doesn't follow mounts, doesn't follow unions,
> doesn't follow symlinks
>
> XXX - Needs to be re-organized to reduce code duplication. But how?
>
> - Create shared lookup_topmost() and build_union() functions that take
> flags or function pointers for real_lookup(), cache_lookup(), etc.
> - Push union code farther down into cache_lookup(), etc.
> - (your idea here)
>
> XXX - Symlinks to other file systems (and probably submounts) don't
> work - see comment in do_lookup().
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 483 ++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/namei.h | 6 +
> 2 files changed, 481 insertions(+), 8 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 408380d..b279686 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -33,6 +33,7 @@
> #include <linux/fcntl.h>
> #include <linux/device_cgroup.h>
> #include <linux/fs_struct.h>
> +#include <linux/union.h>
> #include <asm/uaccess.h>
>
> #define ACC_MODE(x) ("\000\004\002\006"[(x)&O_ACCMODE])
> @@ -415,6 +416,173 @@ static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
> return dentry;
> }
>
> +/**
> + * __cache_lookup_topmost - lookup the topmost (non-)negative dentry
> + *
> + * @nd - parent's nameidata
> + * @name - pathname part to lookup
> + * @path - found dentry for pathname part
> + *
> + * This is used for union mount lookups from dcache. The first non-negative
> + * dentry is searched on all layers of the union stack. Otherwise the topmost
> + * negative dentry is returned.
> + */
> +static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct dentry *dentry;
> +
> + dentry = d_lookup(nd->path.dentry, name);
> + if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
> + dentry = do_revalidate(dentry, nd);
> +
> + /*
> + * Remember the topmost negative dentry in case we don't find anything
> + */
> + path->dentry = dentry;
> + path->mnt = dentry ? nd->path.mnt : NULL;
> +
> + if (!dentry || dentry->d_inode)
> + return !dentry;

While it's a clever trick to return "!dentry" several times in this
function, it's less obvious to the reader what the intention is. Perhaps
document it above the function?

It's also unclear what are the side effects of this function: what does it
do on success or failure: does it return a struct path/dentry/mnt that are
valid? If so, where their refcounts incremented? I'd like to see this
documented.

> + /* look for the first non-negative dentry */
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + dentry = d_hash_and_lookup(nd->path.dentry, name);
> +
> + /*
> + * If parts of the union stack are not in the dcache we need
> + * to do a real lookup
> + */
> + if (!dentry)
> + goto out_dput;
> +
> + /*
> + * If parts of the union don't survive the revalidation we
> + * need to do a real lookup
> + */
> + if (dentry->d_op && dentry->d_op->d_revalidate) {
> + dentry = do_revalidate(dentry, nd);
> + if (!dentry)
> + goto out_dput;
> + }
> +
> + if (dentry->d_inode)
> + goto out_dput;
> +
> + dput(dentry);
> + }
> +
> + return !dentry;
> +
> +out_dput:
> + dput(path->dentry);
> + path->dentry = dentry;
> + path->mnt = dentry ? mntget(nd->path.mnt) : NULL;
> + return !dentry;
> +}
> +
> +/**
> + * __cache_lookup_build_union - build the union stack for this part,
> + * cached version
> + *
> + * This is called after you have the topmost dentry in @path.
> + */
> +static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path last = *path;
> + struct dentry *dentry;
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + dentry = d_hash_and_lookup(nd->path.dentry, name);
> + if (!dentry)
> + return 1;

Another function that can be made to return a boolean.

> + if (dentry->d_op && dentry->d_op->d_revalidate) {
> + dentry = do_revalidate(dentry, nd);
> + if (!dentry)
> + return 1;
> + }
> +
> + if (!dentry->d_inode) {
> + dput(dentry);
> + continue;
> + }
> +
> + /* only directories can be part of a union stack */
> + if (!S_ISDIR(dentry->d_inode->i_mode)) {
> + dput(dentry);
> + break;
> + }
> +
> + /* Add the newly discovered dir to the union stack */
> + append_to_union(last.mnt, last.dentry, nd->path.mnt, dentry);

BTW, the name 'append_to_union', specifically the 'append' part, implies
that a union has a start and an end -- and that you append to the END. But,
where is the end here? The bottom or the top of the union. I think the
term append may be confusing: perhaps a better term would be
"push_onto_union" (and the converse, "pop_top_of_union").

> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> + last.dentry = dentry;
> + last.mnt = mntget(nd->path.mnt);
> + }
> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> +
> + return 0;
> +}
> +
> +/**
> + * cache_lookup_union - lookup a single pathname part from dcache
> + *
> + * This is a union mount capable version of what d_lookup() & revalidate()
> + * would do. This function returns a valid (union) dentry on success.
> + *
> + * Remember: On failure it means that parts of the union aren't cached. You
> + * should call real_lookup() afterwards to find the proper (union) dentry.
> + */
> +static int cache_lookup_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + int res ;
> +
> + if (!IS_MNT_UNION(nd->path.mnt)) {
> + path->dentry = cache_lookup(nd->path.dentry, name, nd);
> + path->mnt = path->dentry ? nd->path.mnt : NULL;
> + res = path->dentry ? 0 : 1;
> + } else {
> + struct path safe = {
> + .dentry = nd->path.dentry,
> + .mnt = nd->path.mnt
> + };

There's something unclean about having to save the nd->path and later on
restore it. Is this due to a limitation of __cache_lookup_build_union
below, or something else? You also (below) compare the saved mnt version
against what you just got: was this the reason for saving the nd->path?

> + path_get(&safe);
> + res = __cache_lookup_topmost(nd, name, path);
> + if (res)
> + goto out;
> +
> + /* only directories can be part of a union stack */
> + if (!path->dentry->d_inode ||
> + !S_ISDIR(path->dentry->d_inode->i_mode))
> + goto out;
> +
> + /* Build the union stack for this part */
> + res = __cache_lookup_build_union(nd, name, path);
> + if (res) {
> + dput(path->dentry);
> + if (path->mnt != safe.mnt)
> + mntput(path->mnt);
> + goto out;
> + }
> +
> +out:
> + path_put(&nd->path);
> + nd->path.dentry = safe.dentry;
> + nd->path.mnt = safe.mnt;
> + }
> +
> + return res;
> +}
> +
> /*
> * Short-cut version of permission(), for calling by
> * path_walk(), when dcache lock is held. Combines parts
> @@ -536,6 +704,146 @@ out_unlock:
> return res;
> }
>
> +/**
> + * __real_lookup_topmost - lookup topmost dentry, non-cached version
> + *
> + * If we reach a dentry with restricted access, we just stop the lookup
> + * because we shouldn't see through that dentry. Same thing for dentry
> + * type mismatch and whiteouts.
> + *
> + * FIXME:
> + * - handle DT_WHT

If this function doesn't yet handle DT_WHT, isn't it sort of a fundamental
functionality that's missing (handling whiteouts)?!

> + * - handle union stacks in use
> + * - handle union stacks mounted upon union stacks
> + * - avoid unnecessary allocations of union locks
> + */
> +static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path next;
> + int err;
> +
> + err = real_lookup(nd, name, path);
> + if (err)
> + return err;
> +
> + if (path->dentry->d_inode)
> + return 0;
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + name->hash = full_name_hash(name->name, name->len);
> + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> + err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
> + name);
> + if (err < 0)
> + goto out;
> + }
> +
> + err = real_lookup(nd, name, &next);
> + if (err)
> + goto out;
> +
> + if (next.dentry->d_inode) {
> + dput(path->dentry);
> + mntget(next.mnt);
> + *path = next;
> + goto out;
> + }
> +
> + dput(next.dentry);
> + }
> +out:
> + if (err)
> + dput(path->dentry);
> + return err;
> +}
> +
> +/**
> + * __real_lookup_build_union: build the union stack for this pathname
> + * part, non-cached version
> + *
> + * Called when not all parts of the union stack are in cache
> + */
> +
> +static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path last = *path;
> + struct path next;
> + int err = 0;
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + /* We need to recompute the hash for lower layer lookups */
> + name->hash = full_name_hash(name->name, name->len);
> + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> + err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
> + name);
> + if (err < 0)
> + goto out;
> + }
> +
> + err = real_lookup(nd, name, &next);
> + if (err)
> + goto out;
> +
> + if (!next.dentry->d_inode) {
> + dput(next.dentry);
> + continue;
> + }
> +
> + /* only directories can be part of a union stack */
> + if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
> + dput(next.dentry);
> + break;
> + }
> +
> + /* now we know we found something "real" */
> + append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> + last.dentry = next.dentry;
> + last.mnt = mntget(next.mnt);
> + }
> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> +out:
> + return err;
> +}
> +
> +static int real_lookup_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
> + int res ;
> +
> + path_get(&safe);
> + res = __real_lookup_topmost(nd, name, path);
> + if (res)
> + goto out;
> +
> + /* only directories can be part of a union stack */
> + if (!path->dentry->d_inode ||
> + !S_ISDIR(path->dentry->d_inode->i_mode))
> + goto out;
> +
> + /* Build the union stack for this part */
> + res = __real_lookup_build_union(nd, name, path);
> + if (res) {
> + dput(path->dentry);
> + if (path->mnt != safe.mnt)
> + mntput(path->mnt);
> + goto out;
> + }
> +
> +out:
> + path_put(&nd->path);
> + nd->path.dentry = safe.dentry;
> + nd->path.mnt = safe.mnt;
> + return res;
> +}
> +
> /*
> * Wrapper to retry pathname resolution whenever the underlying
> * file system returns an ESTALE.
> @@ -790,6 +1098,7 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
> nd->path.mnt = parent;
> }
> follow_mount(&nd->path);
> + follow_union_mount(&nd->path.mnt, &nd->path.dentry);
> }
>
> /*
> @@ -802,6 +1111,9 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> {
> int err;
>
> + if (IS_MNT_UNION(nd->path.mnt))
> + goto need_union_lookup;
> +
> path->dentry = __d_lookup(nd->path.dentry, name);
> path->mnt = nd->path.mnt;
> if (!path->dentry)
> @@ -810,7 +1122,25 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> goto need_revalidate;
>
> done:
> - __follow_mount(path);
> + if (nd->path.mnt != path->mnt) {
> + /*
> + * XXX FIXME: We only want to set this flag if we
> + * crossed from the top layer to the bottom layer of a
> + * union mount. But nd->path.mnt != path->mnt is also
> + * true when we cross from the top layer of a union
> + * mount to another file system, either by symlink or
> + * file system mounted on a directory in the union
> + * mount (probably - haven't tested).
> + *
> + * This might be an issue for every mnt/mnt comparison
> + * - or maybe just during the brief window between
> + * do_lookup() and do_follow_link() or follow_mount().
> + */
> + nd->um_flags |= LAST_LOWLEVEL;

It's unclear to me why you need an extra flag to mark the bottom of the
union. And, if you needed such a marker, then why couldn't it have been set
upon the very first union-mount on top of a non-unioned f/s? (or use
is_unionized instead?)

> + follow_mount(path);
> + } else
> + __follow_mount(path);
> + follow_union_mount(&path->mnt, &path->dentry);
> return 0;
>
> need_lookup:
> @@ -819,6 +1149,16 @@ need_lookup:
> goto fail;
> goto done;
>
> +need_union_lookup:
> + err = cache_lookup_union(nd, name, path);
> + if (!err && path->dentry)
> + goto done;
> +
> + err = real_lookup_union(nd, name, path);
> + if (err)
> + goto fail;
> + goto done;
> +
> need_revalidate:
> path->dentry = do_revalidate(path->dentry, nd);
> if (!path->dentry)
> @@ -857,6 +1197,8 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
> if (nd->depth)
> lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
>
> + follow_union_mount(&nd->path.mnt, &nd->path.dentry);
> +
> /* At this point we know we have a real path component. */
> for(;;) {
> unsigned long hash;
> @@ -1041,6 +1383,7 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei
>
> nd->last_type = LAST_ROOT; /* if there are only slashes... */
> nd->flags = flags;
> + nd->um_flags = 0;
> nd->depth = 0;
> nd->root.mnt = NULL;
>
> @@ -1249,6 +1592,130 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
> return err;
> }
>
> +static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path next;
> + int err;
> +
> + err = lookup_hash(nd, name, path);
> + if (err)
> + return err;
> +
> + if (path->dentry->d_inode)
> + return 0;
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + name->hash = full_name_hash(name->name, name->len);
> + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> + err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
> + name);
> + if (err < 0)
> + goto out;
> + }
> +
> + mutex_lock(&nd->path.dentry->d_inode->i_mutex);
> + err = lookup_hash(nd, name, &next);
> + mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
> + if (err)
> + goto out;
> +
> + if (next.dentry->d_inode) {
> + dput(path->dentry);
> + mntget(next.mnt);
> + *path = next;
> + goto out;
> + }
> +
> + dput(next.dentry);
> + }
> +out:
> + if (err)
> + dput(path->dentry);
> + return err;
> +}
> +
> +static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path last = *path;
> + struct path next;
> + int err = 0;
> +
> + while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> + /* We need to recompute the hash for lower layer lookups */
> + name->hash = full_name_hash(name->name, name->len);
> + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> + err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
> + name);
> + if (err < 0)
> + goto out;
> + }
> +
> + mutex_lock(&nd->path.dentry->d_inode->i_mutex);
> + err = lookup_hash(nd, name, &next);
> + mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
> + if (err)
> + goto out;
> +
> + if (!next.dentry->d_inode) {
> + dput(next.dentry);
> + continue;
> + }
> +
> + /* only directories can be part of a union stack */
> + if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
> + dput(next.dentry);
> + break;
> + }
> +
> + /* now we know we found something "real" */
> + append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> + last.dentry = next.dentry;
> + last.mnt = mntget(next.mnt);
> + }
> +
> + if (last.dentry != path->dentry)
> + path_put(&last);
> +out:
> + return err;
> +}
> +
> +static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
> + int res ;
> +
> + path_get(&safe);
> + res = __hash_lookup_topmost(nd, name, path);
> + if (res)
> + goto out;
> +
> + /* only directories can be part of a union stack */
> + if (!path->dentry->d_inode ||
> + !S_ISDIR(path->dentry->d_inode->i_mode))
> + goto out;
> +
> + /* Build the union stack for this part */
> + res = __hash_lookup_build_union(nd, name, path);
> + if (res) {
> + dput(path->dentry);
> + if (path->mnt != safe.mnt)
> + mntput(path->mnt);
> + goto out;
> + }
> +
> +out:
> + path_put(&nd->path);
> + nd->path.dentry = safe.dentry;
> + nd->path.mnt = safe.mnt;
> + return res;
> +}
> +
> static int __lookup_one_len(const char *name, struct qstr *this,
> struct dentry *base, int len)
> {
> @@ -1756,7 +2223,7 @@ struct file *do_filp_open(int dfd, const char *pathname,
> if (flag & O_EXCL)
> nd.flags |= LOOKUP_EXCL;
> mutex_lock(&dir->d_inode->i_mutex);
> - error = lookup_hash(&nd, &nd.last, &path);
> + error = hash_lookup_union(&nd, &nd.last, &path);
>
> do_last:
> if (error) {
> @@ -1920,7 +2387,7 @@ do_link:
> }
> dir = nd.path.dentry;
> mutex_lock(&dir->d_inode->i_mutex);
> - error = lookup_hash(&nd, &nd.last, &path);
> + error = hash_lookup_union(&nd, &nd.last, &path);
> __putname(nd.last.name);
> goto do_last;
> }
> @@ -1971,7 +2438,7 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
> /*
> * Do the final lookup.
> */
> - err = lookup_hash(nd, &nd->last, &path);
> + err = hash_lookup_union(nd, &nd->last, &path);
> if (err) {
> path.dentry = ERR_PTR(err);
> goto fail;
> @@ -2467,7 +2934,7 @@ static long do_rmdir(int dfd, const char __user *pathname)
> nd.flags &= ~LOOKUP_PARENT;
>
> mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
> - error = lookup_hash(&nd, &nd.last, &path);
> + error = hash_lookup_union(&nd, &nd.last, &path);
> if (error)
> goto exit2;
> error = mnt_want_write(nd.path.mnt);
> @@ -2550,7 +3017,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
> nd.flags &= ~LOOKUP_PARENT;
>
> mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
> - error = lookup_hash(&nd, &nd.last, &path);
> + error = hash_lookup_union(&nd, &nd.last, &path);
> if (!error) {
> /* Why not before? Because we want correct error value */
> if (nd.last.name[nd.last.len])
> @@ -2954,7 +3421,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
>
> trap = lock_rename(new_dir, old_dir);
>
> - error = lookup_hash(&oldnd, &oldnd.last, &old);
> + error = hash_lookup_union(&oldnd, &oldnd.last, &old);
> if (error)
> goto exit3;
> /* source must exist */
> @@ -2973,7 +3440,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
> error = -EINVAL;
> if (old.dentry == trap)
> goto exit4;
> - error = lookup_hash(&newnd, &newnd.last, &new);
> + error = hash_lookup_union(&newnd, &newnd.last, &new);
> if (error)
> goto exit4;
> /* target should not be an ancestor of source */
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index d870ae2..81afb59 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -20,6 +20,7 @@ struct nameidata {
> struct qstr last;
> struct path root;
> unsigned int flags;
> + unsigned int um_flags;

BTW, do we need a separate um_flags, or is there enough space in 'flags' to
store UM's flags as well?

> int last_type;
> unsigned depth;
> char *saved_names[MAX_NESTED_LINKS + 1];
> @@ -35,6 +36,9 @@ struct nameidata {
> */
> enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
>
> +#define LAST_UNION 0x01
> +#define LAST_LOWLEVEL 0x02
> +
> /*
> * The bitmask for a lookup event:
> * - follow links at the end
> @@ -49,6 +53,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
> #define LOOKUP_CONTINUE 4
> #define LOOKUP_PARENT 16
> #define LOOKUP_REVAL 64
> +#define LOOKUP_TOPMOST 128
> +
> /*
> * Intent data
> */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:11:24

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 25/41] union-mount: stop lookup when directory has S_OPAQUE flag set

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Honor the S_OPAQUE flag in the union path lookup.

Was it intentional to have a separate patch which adds opaque directories
support, or should it be part of the larger patch #24?

>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 17 ++++++++++++++---
> 1 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b279686..8ebbf4f 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -523,6 +523,9 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = dentry;
> last.mnt = mntget(nd->path.mnt);
> +
> + if (IS_OPAQUE(last.dentry->d_inode))
> + break;
> }
>
> if (last.dentry != path->dentry)
> @@ -562,7 +565,8 @@ static int cache_lookup_union(struct nameidata *nd, struct qstr *name,
>
> /* only directories can be part of a union stack */
> if (!path->dentry->d_inode ||
> - !S_ISDIR(path->dentry->d_inode->i_mode))
> + !S_ISDIR(path->dentry->d_inode->i_mode) ||
> + IS_OPAQUE(path->dentry->d_inode))
> goto out;
>
> /* Build the union stack for this part */
> @@ -804,6 +808,9 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = next.dentry;
> last.mnt = mntget(next.mnt);
> +
> + if (IS_OPAQUE(last.dentry->d_inode))
> + break;
> }
>
> if (last.dentry != path->dentry)
> @@ -825,7 +832,8 @@ static int real_lookup_union(struct nameidata *nd, struct qstr *name,
>
> /* only directories can be part of a union stack */
> if (!path->dentry->d_inode ||
> - !S_ISDIR(path->dentry->d_inode->i_mode))
> + !S_ISDIR(path->dentry->d_inode->i_mode) ||
> + IS_OPAQUE(path->dentry->d_inode))
> goto out;
>
> /* Build the union stack for this part */
> @@ -1111,7 +1119,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> {
> int err;
>
> - if (IS_MNT_UNION(nd->path.mnt))
> + if (IS_MNT_UNION(nd->path.mnt) && !IS_OPAQUE(nd->path.dentry->d_inode))
> goto need_union_lookup;
>
> path->dentry = __d_lookup(nd->path.dentry, name);
> @@ -1676,6 +1684,9 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = next.dentry;
> last.mnt = mntget(next.mnt);
> +
> + if (IS_OPAQUE(last.dentry->d_inode))
> + break;
> }
>
> if (last.dentry != path->dentry)
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:12:12

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 26/41] union-mount: stop lookup when finding a whiteout

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> Stop the lookup if we find a whiteout during union path lookup.

Was it intentional to have a separate patch which adds opaque directories
support, or should it be part of the larger patch #24?

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 30 ++++++++++++++++++++++--------
> 1 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 8ebbf4f..fb463ac 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -442,10 +442,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
> path->dentry = dentry;
> path->mnt = dentry ? nd->path.mnt : NULL;
>
> - if (!dentry || dentry->d_inode)
> + if (!dentry || (dentry->d_inode || d_is_whiteout(dentry)))
> return !dentry;

Unnecessary set of () around second and third || clauses above.

>
> - /* look for the first non-negative dentry */
> + /* look for the first non-negative or whiteout dentry */
>
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> dentry = d_hash_and_lookup(nd->path.dentry, name);
> @@ -467,7 +467,7 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
> goto out_dput;
> }
>
> - if (dentry->d_inode)
> + if (dentry->d_inode || d_is_whiteout(dentry))
> goto out_dput;
>
> dput(dentry);
> @@ -505,6 +505,11 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
> return 1;
> }
>
> + if (d_is_whiteout(dentry)) {
> + dput(dentry);
> + break;
> + }
> +
> if (!dentry->d_inode) {
> dput(dentry);
> continue;
> @@ -716,7 +721,6 @@ out_unlock:
> * type mismatch and whiteouts.
> *
> * FIXME:
> - * - handle DT_WHT

Ah, ok: so this patch adds DT_WHT support. Still, I don't see why it can't
just be folded into the already pretty large patch #24; and maybe patch 24
could be split a different way to facilitated easier reviewing?

> * - handle union stacks in use
> * - handle union stacks mounted upon union stacks
> * - avoid unnecessary allocations of union locks
> @@ -731,7 +735,7 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (err)
> return err;
>
> - if (path->dentry->d_inode)
> + if (path->dentry->d_inode || d_is_whiteout(path->dentry))
> return 0;
>
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> @@ -747,7 +751,7 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (err)
> goto out;
>
> - if (next.dentry->d_inode) {
> + if (next.dentry->d_inode || d_is_whiteout(next.dentry)) {
> dput(path->dentry);
> mntget(next.mnt);
> *path = next;
> @@ -790,6 +794,11 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
> if (err)
> goto out;
>
> + if (d_is_whiteout(next.dentry)) {
> + dput(next.dentry);
> + break;
> + }
> +
> if (!next.dentry->d_inode) {
> dput(next.dentry);
> continue;
> @@ -1610,7 +1619,7 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (err)
> return err;
>
> - if (path->dentry->d_inode)
> + if (path->dentry->d_inode || d_is_whiteout(path->dentry))
> return 0;
>
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> @@ -1628,7 +1637,7 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (err)
> goto out;
>
> - if (next.dentry->d_inode) {
> + if (next.dentry->d_inode || d_is_whiteout(next.dentry)) {
> dput(path->dentry);
> mntget(next.mnt);
> *path = next;
> @@ -1666,6 +1675,11 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
> if (err)
> goto out;
>
> + if (d_is_whiteout(next.dentry)) {
> + dput(next.dentry);
> + break;
> + }
> +
> if (!next.dentry->d_inode) {
> dput(next.dentry);
> continue;
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:14:32

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 27/41] union-mount: in-kernel file copy between union mounted filesystems

In message <[email protected]>, Valerie Aurora writes:
> This patch introduces in-kernel file copy between union mounted
> filesystems. When a file is opened for writing but resides on a lower (thus
> read-only) layer of the union stack it is copied to the topmost union layer
> first.

There are many stupid applications out there which open(O_RW)/close() w/o
ever writing anything. You'll do a lot of unnecessary copyup this way. In
Unionfs I had to implement lazy copyup upon the first actual ->write to a
file which was opened for writing and was a candidate for copyup at open()
time.

Also, some apps open a file with O_WR|O_TRUNC, b/c they want to overwrite
the file with new data (and they don't want to truncate at file close time).
But, in your case, you'll do all the hard work of copyup only to find you
have to discard all that copied up data. You need an optimization for
open(O_TRUNC) as well.

These two optimizations can be put on the future todo list for now.

> This patch uses the do_splice() for doing the in-kernel file copy.
>
> XXX - Optimize for non-union mounts in union mount enabled kernels
> (esp. call to is_unionized() in do_filp_open()).
>
> XXX - "flags" argument to union_copyup() is unused - bug? Leftover
> code?
>
> Signed-off-by: Bharata B Rao <[email protected]>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 64 +++++++++-
> fs/union.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/union.h | 7 +
> 3 files changed, 383 insertions(+), 4 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index fb463ac..f7ef769 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1050,7 +1050,7 @@ static int __follow_mount(struct path *path)
> return res;
> }
>
> -static void follow_mount(struct path *path)
> +void follow_mount(struct path *path)
> {
> while (d_mountpoint(path->dentry)) {
> struct vfsmount *mounted = lookup_mnt(path);
> @@ -1284,6 +1284,21 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
> if (err)
> break;
>
> + if ((nd->flags & LOOKUP_TOPMOST) &&
> + (nd->um_flags & LAST_LOWLEVEL)) {

OK, so finally I see a use of this LAST_LOWLEVEL flag. Still, it's not
clear to me immediately why we need this and how it's supposed to get used.
Perhaps some more comments in the code are needed?

Also, if we start with only a two-level union, the terms "TOPMOST" and
"LAST_LOWLEVEL" are somewhat misleading.

> + struct dentry *dentry;
> +
> + dentry = union_create_topmost(nd, &this, &next);
> + if (IS_ERR(dentry)) {
> + err = PTR_ERR(dentry);
> + goto out_dput;
> + }
> + path_put_conditional(&next, nd);
> + next.mnt = nd->path.mnt;
> + next.dentry = dentry;
> + nd->um_flags &= ~LAST_LOWLEVEL;
> + }
> +
> err = -ENOENT;
> inode = next.dentry->d_inode;
> if (!inode)
> @@ -1333,6 +1348,22 @@ last_component:
> err = do_lookup(nd, &this, &next);
> if (err)
> break;
> +
> + if ((nd->flags & LOOKUP_TOPMOST) &&
> + (nd->um_flags & LAST_LOWLEVEL)) {
> + struct dentry *dentry;
> +
> + dentry = union_create_topmost(nd, &this, &next);
> + if (IS_ERR(dentry)) {
> + err = PTR_ERR(dentry);
> + goto out_dput;
> + }
> + path_put_conditional(&next, nd);
> + next.mnt = nd->path.mnt;
> + next.dentry = dentry;
> + nd->um_flags &= ~LAST_LOWLEVEL;
> + }
> +
> inode = next.dentry->d_inode;
> if ((lookup_flags & LOOKUP_FOLLOW)
> && inode && inode->i_op->follow_link) {
> @@ -1709,7 +1740,7 @@ out:
> return err;
> }
>
> -static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
> +int hash_lookup_union(struct nameidata *nd, struct qstr *name,
> struct path *path)
> {
> struct path safe = { .dentry = nd->path.dentry, .mnt = nd->path.mnt };
> @@ -2208,6 +2239,12 @@ struct file *do_filp_open(int dfd, const char *pathname,
> &nd, flag);
> if (error)
> return ERR_PTR(error);
> + if (unlikely(flag & FMODE_WRITE)) {

Why the unlikely()? opening a file for writing is neither too likely or too
unlikely -- so remove this unlikely() wrapper.

> + /* Check for union, etc. in union_copyup */
> + error = union_copyup(&nd, flag /* XXX not used */);
> + if (error)
> + return ERR_PTR(error);
> + }
> goto ok;
> }
>
> @@ -2311,10 +2348,23 @@ do_last:
> if (path.dentry->d_inode->i_op->follow_link)
> goto do_link;
>
> - path_to_nameidata(&path, &nd);
> error = -EISDIR;
> if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
> - goto exit;
> + goto exit_dput;
> +
> + /*
> + * If this file is on a lower layer of the union stack, copy it to the
> + * topmost layer before opening it
> + */
> + if (path.dentry->d_inode &&
> + (path.dentry->d_parent != dir) &&
> + S_ISREG(path.dentry->d_inode->i_mode)) {
> + error = __union_copyup(&path, &nd, &path);
> + if (error)
> + goto exit_dput;
> + }
> +
> + path_to_nameidata(&path, &nd);
> ok:
> /*
> * Consider:
> @@ -3472,6 +3522,12 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
> error = -ENOTEMPTY;
> if (new.dentry == trap)
> goto exit5;
> + /* renaming on unions is done by the user-space */
> + error = -EXDEV;
> + if (is_unionized(oldnd.path.dentry, oldnd.path.mnt))
> + goto exit5;
> + if (is_unionized(newnd.path.dentry, newnd.path.mnt))
> + goto exit5;

Nit: two 'if' statements can be merged into one.

> error = mnt_want_write(oldnd.path.mnt);
> if (error)
> diff --git a/fs/union.c b/fs/union.c
> index 341fc03..de31fc9 100644
> --- a/fs/union.c
> +++ b/fs/union.c
> @@ -21,6 +21,14 @@
> #include <linux/mount.h>
> #include <linux/fs_struct.h>
> #include <linux/union.h>
> +#include <linux/namei.h>
> +#include <linux/file.h>
> +#include <linux/mm.h>
> +#include <linux/quotaops.h>
> +#include <linux/dnotify.h>
> +#include <linux/security.h>
> +#include <linux/pipe_fs_i.h>
> +#include <linux/splice.h>
>
> /*
> * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
> @@ -337,6 +345,314 @@ int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
> }
>
> /*
> + * Union mount copyup support
> + */
> +
> +extern int hash_lookup_union(struct nameidata *, struct qstr *, struct path *);
> +extern void follow_mount(struct path *path);

Shouldn't these extern's come from some header file already?

> +
> +/*
> + * union_relookup_topmost - lookup and create the topmost path to dentry
> + * @nd: pointer to nameidata
> + * @flags: lookup flags
> + */
> +static int union_relookup_topmost(struct nameidata *nd, int flags)
> +{
> + int err;
> + char *kbuf, *name;
> + struct nameidata this;
> +
> + kbuf = (char *)__get_free_page(GFP_KERNEL);
> + if (!kbuf)
> + return -ENOMEM;
> +
> + name = d_path(&nd->path, kbuf, PAGE_SIZE);
> + err = PTR_ERR(name);
> + if (IS_ERR(name))
> + goto free_page;
> +
> + err = path_lookup(name, flags|LOOKUP_CREATE|LOOKUP_TOPMOST, &this);
> + if (err)
> + goto free_page;
> +
> + path_put(&nd->path);
> + nd->path.dentry = this.path.dentry;
> + nd->path.mnt = this.path.mnt;
> +
> + /*
> + * the nd->flags should be unchanged
> + */
> + BUG_ON(this.um_flags & LAST_LOWLEVEL);
> + nd->um_flags &= ~LAST_LOWLEVEL;
> + free_page:
> + free_page((unsigned long)kbuf);
> + return err;
> +}
> +
> +/*
> + * union_create_topmost - create the topmost path component
> + * @nd: pointer to nameidata of the base directory
> + * @name: pointer to file name
> + * @path: pointer to path of the overlaid file
> + *
> + * This is called by __link_path_walk() to create the directories on a path
> + * when it is called with LOOKUP_TOPMOST.
> + */
> +struct dentry *union_create_topmost(struct nameidata *nd, struct qstr *name,
> + struct path *path)
> +{
> + struct dentry *dentry, *parent = nd->path.dentry;
> + int res, mode = path->dentry->d_inode->i_mode;
> +
> + if (parent->d_sb == path->dentry->d_sb)
> + return ERR_PTR(-EEXIST);
> +
> + res = mnt_want_write(nd->path.mnt);
> + if (res)
> + return ERR_PTR(res);
> +
> + mutex_lock(&parent->d_inode->i_mutex);
> + dentry = lookup_one_len(name->name, nd->path.dentry, name->len);

I thought new users of lookup_one_len were discouraged b/c it doesn't follow
vfsmounts (you're calling lookup_one_len again below).

> + if (IS_ERR(dentry))
> + goto out_unlock;
> +
> + switch (mode & S_IFMT) {
> + case S_IFREG:
> + /*
> + * FIXME: Does this make any sense in this case?
> + * Special case - lookup gave negative, but... we had foo/bar/
> + * From the vfs_mknod() POV we just have a negative dentry -

I'm not sure I understand this comment and what you're trying to optimize
here. Plus, what does this case S_IFREG have to do with vfs_mknod()?

> + * all is fine. Let's be bastards - you had / on the end,you've
> + * been asking for (non-existent) directory. -ENOENT for you.
> + */
> + if (name->name[name->len] && !dentry->d_inode) {
> + dput(dentry);
> + dentry = ERR_PTR(-ENOENT);
> + goto out_unlock;
> + }
> +
> + res = vfs_create(parent->d_inode, dentry, mode, nd);
> + if (res) {
> + dput(dentry);
> + dentry = ERR_PTR(res);
> + goto out_unlock;
> + }
> + break;
> + case S_IFDIR:
> + res = vfs_mkdir(parent->d_inode, dentry, mode);
> + if (res) {
> + dput(dentry);
> + dentry = ERR_PTR(res);
> + goto out_unlock;
> + }
> +
> + res = append_to_union(nd->path.mnt, dentry, path->mnt,
> + path->dentry);
> + if (res) {
> + dput(dentry);
> + dentry = ERR_PTR(res);
> + goto out_unlock;
> + }
> + break;
> + default:

So, you're not handling anything other than REG/DIR objects? If so,
document this as a limitation here and in union-mounts.txt.

> + dput(dentry);
> + dentry = ERR_PTR(-EINVAL);
> + goto out_unlock;
> + }
> +
> + out_unlock:
> + mutex_unlock(&parent->d_inode->i_mutex);
> + mnt_drop_write(nd->path.mnt);
> + return dentry;
> +}
> +
> +static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
> + struct dentry *new_dentry, struct vfsmount *new_mnt)
> +{
> + int ret;
> + size_t size;
> + loff_t offset;
> + struct file *old_file, *new_file;
> + const struct cred *cred = current_cred();
> +
> + dget(old_dentry);
> + mntget(old_mnt);
> + old_file = dentry_open(old_dentry, old_mnt, O_RDONLY, cred);
> + if (IS_ERR(old_file))
> + return PTR_ERR(old_file);
> +
> + dget(new_dentry);
> + mntget(new_mnt);
> + new_file = dentry_open(new_dentry, new_mnt, O_WRONLY, cred);
> + ret = PTR_ERR(new_file);
> + if (IS_ERR(new_file))
> + goto fput_old;
> +
> + size = i_size_read(old_file->f_path.dentry->d_inode);
> + if (((size_t)size != size) || ((ssize_t)size != size)) {
> + ret = -EFBIG;
> + goto fput_new;
> + }
> +
> + offset = 0;
> + ret = do_splice_direct(old_file, &offset, new_file, size,
> + SPLICE_F_MOVE);
> + if (ret >= 0)
> + ret = 0;

Is there any chance that do_splice_direct would perform a partial copy of
the file and still return "ret>0"? If so, you're masking out a
partial-copyup "error" condition.

> + fput_new:
> + fput(new_file);
> + fput_old:
> + fput(old_file);
> + return ret;
> +}
> +
> +/**
> + * __union_copyup - copy a file to the topmost directory
> + * @old: pointer to path of the old file name
> + * @new_nd: pointer to nameidata of the topmost directory
> + * @new: pointer to path of the new file name
> + *
> + * The topmost directory @new_nd must already be locked. Creates the topmost
> + * file if it doesn't exist yet.
> + */
> +int __union_copyup(struct path *old, struct nameidata *new_nd, struct path *new)
> +{
> + struct dentry *dentry;
> + int error;
> +
> + /* Maybe this should be -EINVAL */
> + if (S_ISDIR(old->dentry->d_inode->i_mode))
> + return -EISDIR;
> +
> + if (new_nd->path.dentry != new->dentry->d_parent) {
> + mutex_lock(&new_nd->path.dentry->d_inode->i_mutex);
> + dentry = lookup_one_len(new->dentry->d_name.name,
> + new_nd->path.dentry,
> + new->dentry->d_name.len);
> + mutex_unlock(&new_nd->path.dentry->d_inode->i_mutex);
> + if (IS_ERR(dentry))
> + return PTR_ERR(dentry);
> + error = -EEXIST;
> + if (dentry->d_inode)
> + goto out_dput;
> + } else
> + dentry = dget(new->dentry);
> +
> + error = mnt_want_write(new_nd->path.mnt);
> + if (error)
> + goto out_dput;
> +
> + if (!dentry->d_inode) {
> + error = vfs_create(new_nd->path.dentry->d_inode, dentry,
> + old->dentry->d_inode->i_mode, new_nd);
> + if (error)
> + goto out_drop_write;
> + }
> +
> + BUG_ON(!S_ISREG(old->dentry->d_inode->i_mode));
> + error = union_copy_file(old->dentry, old->mnt, dentry,
> + new_nd->path.mnt);
> + if (error) {
> + /* FIXME: are there return value we should not
> + * BUG() on ? */
> + BUG_ON(vfs_unlink(new_nd->path.dentry->d_inode,
> + dentry));

I think a BUG_ON is too severe an action to take at this early stage in UM's
development. I'd opt for printk(WARN) or a WARN_ON instead. The worst
thing that could happen for now is some cruft loft-over in the f/s, which
might be helpful for you to figure out why vfs_unlink failed (copyups are
sensitive to EACCES/EPERM/ENOQUOTA issues esp. when selinux and friends are
enabled).

> + goto out_drop_write;
> + }
> +
> + mnt_drop_write(new_nd->path.mnt);
> + dput(new->dentry);
> + new->dentry = dentry;
> + if (new->mnt != new_nd->path.mnt)
> + mntput(new->mnt);
> + new->mnt = new_nd->path.mnt;
> + return error;
> +
> +out_drop_write:
> + mnt_drop_write(new_nd->path.mnt);
> +out_dput:
> + dput(dentry);
> + return error;
> +}
> +
> +/*
> + * union_copyup - copy a file to the topmost layer of the union stack
> + * @nd: nameidata pointer to the file
> + * @flags: flags given to open_namei
> + */
> +int union_copyup(struct nameidata *nd, int flags)
> +{
> + struct qstr this;
> + char *name;
> + struct dentry *dir;
> + struct path path;
> + int err;
> +
> + if (!is_unionized(nd->path.dentry, nd->path.mnt))
> + return 0;
> + if (!S_ISREG(nd->path.dentry->d_inode->i_mode))
> + return 0;
> +
> + /* safe the name for hash_lookup_union() */
> + this.len = nd->path.dentry->d_name.len;
> + this.hash = nd->path.dentry->d_name.hash;
> + name = kmalloc(this.len + 1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> + this.name = name;
> + memcpy(name, nd->path.dentry->d_name.name, nd->path.dentry->d_name.len);
> + name[this.len] = 0;
> +
> + err = union_relookup_topmost(nd, nd->flags|LOOKUP_PARENT);
> + if (err) {
> + kfree(name);
> + return err;
> + }
> + nd->flags &= ~LOOKUP_PARENT;
> +
> + dir = nd->path.dentry;
> + mutex_lock(&dir->d_inode->i_mutex);
> + err = hash_lookup_union(nd, &this, &path);
> + mutex_unlock(&dir->d_inode->i_mutex);
> + kfree(name);
> + if (err)
> + return err;
> +
> + err = -ENOENT;
> + if (!path.dentry->d_inode)
> + goto exit_dput;
> +
> + /* Necessary?! I guess not ... */
> + follow_mount(&path);
> +
> + err = -ENOENT;
> + if (!path.dentry->d_inode)
> + goto exit_dput;
> +
> + err = -EISDIR;
> + if (!S_ISREG(path.dentry->d_inode->i_mode))
> + goto exit_dput;
> +
> + if (path.dentry->d_parent != nd->path.dentry) {
> + err = __union_copyup(&path, nd, &path);
> + if (err)
> + goto exit_dput;
> + }
> +
> + dput(nd->path.dentry);
> + if (nd->path.mnt != path.mnt)
> + mntput(nd->path.mnt);
> + nd->path = path;
> + return 0;
> +
> +exit_dput:
> + dput(path.dentry);
> + if (path.mnt != nd->path.mnt)
> + mntput(path.mnt);
> + return err;
> +}
> +
> +/*
> * This must be called when unhashing a dentry. This is called with dcache_lock
> * and unhashes all unions this dentry is in.
> */
> diff --git a/include/linux/union.h b/include/linux/union.h
> index 0b6f356..405baa9 100644
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -53,6 +53,10 @@ extern void __shrink_d_unions(struct dentry *, struct list_head *);
> extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
> struct dentry *);
> extern void detach_mnt_union(struct vfsmount *);
> +extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
> + struct path *);
> +extern int __union_copyup(struct path *, struct nameidata *, struct path *);

Hmm, I'm curious why the internal __union_copyup helper needs to be
extern'ed? Who else uses it?

> +extern int union_copyup(struct nameidata *, int);
>
> #else /* CONFIG_UNION_MOUNT */
>
> @@ -67,6 +71,9 @@ extern void detach_mnt_union(struct vfsmount *);
> #define __shrink_d_unions(x,y) do { } while (0)
> #define attach_mnt_union(x, y, z) do { } while (0)
> #define detach_mnt_union(x) do { } while (0)
> +#define union_create_topmost(x, y, z) ({ BUG(); (NULL); })
> +#define __union_copyup(x, y, z) ({ BUG(); (0); })
> +#define union_copyup(x, y) ({ (0); })
>
> #endif /* CONFIG_UNION_MOUNT */
> #endif /* __KERNEL__ */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:14:44

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 29/41] union-mount: Always create topmost directory on open

In message <[email protected]>, Valerie Aurora writes:
> When we open a directory, always create a matching directory on the
> top-level. This way we don't have to go back and create all the
> directories on the path to an element when we want to copy it up.
>
> XXX - Turn into #ifdef'able function
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 34 ++++++++++++++++++++++++++++++----
> 1 files changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1f2a214..8d95eb1 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1284,8 +1284,31 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
> if (err)
> break;
>
> - if ((nd->flags & LOOKUP_TOPMOST) &&
> - (nd->um_flags & LAST_LOWLEVEL)) {
> + /*
> + * We want to create this element on the top level
> + * file system in two cases:
> + *
> + * - We are specifically told to - LOOKUP_TOPMOST.
> + * - This is a directory, and it does not yet exist on
> + * the top level. Various tricks only work if
> + * directories always exist on the top level.
> + *
> + * In either case, only create this element on the top
> + * level if the last element is located on the lower
> + * level. If the last element is located on the top
> + * level, then every single element in the path
> + * already exists on the top level.
> + *
> + * Note that we can assume that the parent is on the
> + * top level since we always create the directory on
> + * the top level.
> + */

OK, yes: a number of things (not "tricks" as you call them) become easier if
you always copyup directories upon path traversal. It's esp. nice wrt
locking semantics to know that the parent dir must always exist.

But, what you're trading off is that you'll be consuming many inodes and
directories on the topmost layer; worse, this policy turns an innocent
readonly "find . -print" into a massive meta-data write operation. If
that's an acceptable compromise, then fine: but you should document this
carefully under a section named "limitations" in your design doc.

> + if ((nd->um_flags & LAST_LOWLEVEL) &&
> + ((next.dentry->d_inode &&
> + S_ISDIR(next.dentry->d_inode->i_mode) &&
> + (nd->path.mnt != next.mnt)) ||
> + (nd->flags & LOOKUP_TOPMOST))) {
> struct dentry *dentry;
>
> dentry = union_create_topmost(nd, &this, &next);
> @@ -1349,8 +1372,11 @@ last_component:
> if (err)
> break;
>
> - if ((nd->flags & LOOKUP_TOPMOST) &&
> - (nd->um_flags & LAST_LOWLEVEL)) {
> + if ((nd->um_flags & LAST_LOWLEVEL) &&
> + ((next.dentry->d_inode &&
> + S_ISDIR(next.dentry->d_inode->i_mode) &&
> + (nd->path.mnt != next.mnt)) ||
> + (nd->flags & LOOKUP_TOPMOST))) {
> struct dentry *dentry;
>
> dentry = union_create_topmost(nd, &this, &next);
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:15:17

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 30/41] fallthru: Basic fallthru definitions

BTW, your patch set has an awkward order: you have some vfs and ext2/tmpfs
patches, then UM patches, then fallthru patches for specific file systems,
then UM patches again. Is it possible to order them so all UM patches go in
order; all lower-level F/S patches are also sequential; etc.?

In message <[email protected]>, Valerie Aurora writes:
> Define the fallthru dcache flag and file system op.
>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> include/linux/dcache.h | 6 ++++++
> include/linux/fs.h | 1 +
> 2 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 730c432..a55f79f 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -193,6 +193,7 @@ d_iput: no no no yes
>
> #define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
> #define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */
> +#define DCACHE_FALLTHRU 0x0100 /* Keep looking in the file system below */
>
> #define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */
>
> @@ -381,6 +382,11 @@ static inline int d_is_whiteout(struct dentry *dentry)
> return (dentry->d_flags & DCACHE_WHITEOUT);
> }
>
> +static inline int d_is_fallthru(struct dentry *dentry)
> +{
> + return (dentry->d_flags & DCACHE_FALLTHRU);
> +}
> +
> static inline struct dentry *dget_parent(struct dentry *dentry)
> {
> struct dentry *ret;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index efea78c..57690ab 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1532,6 +1532,7 @@ struct inode_operations {
> int (*rmdir) (struct inode *,struct dentry *);
> int (*mknod) (struct inode *,struct dentry *,int,dev_t);
> int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
> + int (*fallthru) (struct inode *, struct dentry *);

Same nit I had before: why add ->fallthru in the middle of inode_operations
and not in the end?

> int (*rename) (struct inode *, struct dentry *,
> struct inode *, struct dentry *);
> int (*readlink) (struct dentry *, char __user *,int);
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:15:51

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 31/41] fallthru: Support for fallthru entries in union mount lookup

In message <[email protected]>, Valerie Aurora writes:
> A fallthru directory entry overrides the opaque flag for its parent
> directory (for this directory entry only). Before, we stopped
> building the union stack when we encountered an opaque directory; now
> we include directories below opaque directories in the union stack and
> check for opacity during lookup.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 7 +++----
> fs/namei.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------------
> 2 files changed, 48 insertions(+), 18 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index d80a3bb..ca8a661 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1086,7 +1086,7 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
> static void __d_instantiate(struct dentry *dentry, struct inode *inode)
> {
> if (inode) {
> - dentry->d_flags &= ~DCACHE_WHITEOUT;
> + dentry->d_flags &= ~(DCACHE_WHITEOUT|DCACHE_FALLTHRU);
> list_add(&dentry->d_alias, &inode->i_dentry);
> }
> dentry->d_inode = inode;
> @@ -1638,9 +1638,8 @@ void d_delete(struct dentry * dentry)
>
> static void __d_rehash(struct dentry * entry, struct hlist_head *list)
> {
> -
> - entry->d_flags &= ~DCACHE_UNHASHED;
> - hlist_add_head_rcu(&entry->d_hash, list);
> + entry->d_flags &= ~DCACHE_UNHASHED;
> + hlist_add_head_rcu(&entry->d_hash, list);
> }
>
> static void _d_rehash(struct dentry * entry)
> diff --git a/fs/namei.c b/fs/namei.c
> index 8d95eb1..61e94aa 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -416,6 +416,28 @@ static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
> return dentry;
> }
>
> +/*
> + * Theory of operation for opaque, whiteout, and fallthru:
> + *
> + * whiteout: Unconditionally stop lookup here - ENOENT
> + *
> + * opaque: Don't lookup in directories lower in the union stack
> + *
> + * fallthru: While looking up an entry, ignore the opaque flag for the
> + * current directory only.
> + *
> + * A union stack is a linked list of directory dentries which appear
> + * in the same place in the namespace. When constructing the union
> + * stack, we include directories below opaque directories so that we
> + * can properly handle fallthrus. All non-fallthru lookups have to
> + * check for the opaque flag on the parent directory and obey it.
> + *
> + * In general, the code pattern is to lookup the the topmost entry
> + * first (either the first visible non-negative dentry or a negative
> + * dentry in the topmost layer of the union), then build the union
> + * stack for the newly looked-up entry (if it is a directory).
> + */

Excellent comment; even better said and clearer that the UM design doc.
It's almost a pity such excellent description is buried in the middle of
namei.c.

> +
> /**
> * __cache_lookup_topmost - lookup the topmost (non-)negative dentry
> *
> @@ -445,6 +467,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (!dentry || (dentry->d_inode || d_is_whiteout(dentry)))
> return !dentry;
>
> + /* Keep going through opaque directories if we found a fallthru */
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(dentry))
> + return !dentry;
> +
> /* look for the first non-negative or whiteout dentry */
>
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> @@ -470,6 +496,10 @@ static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (dentry->d_inode || d_is_whiteout(dentry))
> goto out_dput;
>
> + /* Stop the lookup on opaque parent and non-fallthru child */
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(dentry))

There's a lot of repetition of this condition:

IS_OPAQUE(dentry->d_inode) && !d_is_fallthru(dentry)

in this patch. Perhaps this should be macro'ized?

#define CANT_GO_ON(d) (IS_OPAQUE((d)->d_inode) && !d_is_fallthru((d))) // :-)

> + goto out_dput;
> +
> dput(dentry);
> }
>
> @@ -528,9 +558,6 @@ static int __cache_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = dentry;
> last.mnt = mntget(nd->path.mnt);
> -
> - if (IS_OPAQUE(last.dentry->d_inode))
> - break;
> }
>
> if (last.dentry != path->dentry)
> @@ -570,8 +597,7 @@ static int cache_lookup_union(struct nameidata *nd, struct qstr *name,
>
> /* only directories can be part of a union stack */
> if (!path->dentry->d_inode ||
> - !S_ISDIR(path->dentry->d_inode->i_mode) ||
> - IS_OPAQUE(path->dentry->d_inode))
> + !S_ISDIR(path->dentry->d_inode->i_mode))
> goto out;
>
> /* Build the union stack for this part */
> @@ -738,6 +764,9 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (path->dentry->d_inode || d_is_whiteout(path->dentry))
> return 0;
>
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(path->dentry))
> + return 0;
> +
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> name->hash = full_name_hash(name->name, name->len);
> if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> @@ -758,6 +787,9 @@ static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
> goto out;
> }
>
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(next.dentry))
> + goto out;
> +
> dput(next.dentry);
> }
> out:
> @@ -817,9 +849,6 @@ static int __real_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = next.dentry;
> last.mnt = mntget(next.mnt);
> -
> - if (IS_OPAQUE(last.dentry->d_inode))
> - break;
> }
>
> if (last.dentry != path->dentry)
> @@ -841,8 +870,7 @@ static int real_lookup_union(struct nameidata *nd, struct qstr *name,
>
> /* only directories can be part of a union stack */
> if (!path->dentry->d_inode ||
> - !S_ISDIR(path->dentry->d_inode->i_mode) ||
> - IS_OPAQUE(path->dentry->d_inode))
> + !S_ISDIR(path->dentry->d_inode->i_mode))
> goto out;
>
> /* Build the union stack for this part */
> @@ -1128,7 +1156,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> {
> int err;
>
> - if (IS_MNT_UNION(nd->path.mnt) && !IS_OPAQUE(nd->path.dentry->d_inode))
> + if (IS_MNT_UNION(nd->path.mnt))
> goto need_union_lookup;
>
> path->dentry = __d_lookup(nd->path.dentry, name);
> @@ -1679,6 +1707,9 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
> if (path->dentry->d_inode || d_is_whiteout(path->dentry))
> return 0;
>
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(path->dentry))
> + return 0;
> +
> while (follow_union_down(&nd->path.mnt, &nd->path.dentry)) {
> name->hash = full_name_hash(name->name, name->len);
> if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
> @@ -1701,6 +1732,9 @@ static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
> goto out;
> }
>
> + if (IS_OPAQUE(nd->path.dentry->d_inode) && !d_is_fallthru(next.dentry))
> + goto out;
> +
> dput(next.dentry);
> }
> out:
> @@ -1755,9 +1789,6 @@ static int __hash_lookup_build_union(struct nameidata *nd, struct qstr *name,
> path_put(&last);
> last.dentry = next.dentry;
> last.mnt = mntget(next.mnt);
> -
> - if (IS_OPAQUE(last.dentry->d_inode))
> - break;
> }
>
> if (last.dentry != path->dentry)
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:17:37

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 32/41] fallthru: ext2 fallthru support

In message <[email protected]>, Valerie Aurora writes:
> Add support for fallthru directory entries to ext2.
>
> XXX - Makes up inode number for fallthru entry
> XXX - Might be better implemented as special symlinks
>
> Cc: Theodore Tso <[email protected]>
> Cc: [email protected]
> Signed-off-by: Valerie Aurora <[email protected]>
> Signed-off-by: Jan Blunck <[email protected]>
> ---
> fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext2/ext2.h | 1 +
> fs/ext2/namei.c | 20 ++++++++++
> include/linux/ext2_fs.h | 1 +
> 4 files changed, 110 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index d4628c0..2665bc6 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
> {
> if (len != de->name_len)
> return 0;
> - if (!de->inode && (de->file_type != EXT2_FT_WHT))
> + if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
> + (de->file_type != EXT2_FT_FALLTHRU)))

Extra set of () unnecessary here and in several places below.

> return 0;
> return !memcmp(name, de->name, len);
> }
> @@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
> [EXT2_FT_SOCK] = DT_SOCK,
> [EXT2_FT_SYMLINK] = DT_LNK,
> [EXT2_FT_WHT] = DT_WHT,
> + [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
> };
>
> #define S_SHIFT 12
> @@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
> ext2_put_page(page);
> return 0;
> }
> + } else if (de->file_type == EXT2_FT_FALLTHRU) {
> + int over;
> + unsigned char d_type = DT_UNKNOWN;
> +
> + offset = (char *)de - kaddr;
> + /* XXX We don't know the inode number
> + * of the directory entry in the
> + * underlying file system. Should
> + * look it up, either on fallthru
> + * creation at first readdir or now at
> + * filldir time. */
> + over = filldir(dirent, de->name, de->name_len,
> + (n<<PAGE_CACHE_SHIFT) | offset,
> + 123 /* Made up ino */, d_type);

So, why 123 and not at least some other unused number below 10: at least
that way it's in the ext2 "reserved" range should something go horribly
wrong (like a power failure right shortly thereafter).

BTW, this yet-unimplemented functionality should be mentioned under
"limitations" or something in the current design doc. I also think the
design doc should list all short-term and long-term things that need to be
implemented, and in what order.

> + if (over) {
> + ext2_put_page(page);
> + return 0;
> + }
> }
> filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
> }
> @@ -463,6 +483,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
> spin_lock(&dentry->d_lock);
> dentry->d_flags |= DCACHE_WHITEOUT;
> spin_unlock(&dentry->d_lock);
> + } else if(!res && de->file_type == EXT2_FT_FALLTHRU) {
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> }
> ext2_put_page(page);
> }
> @@ -532,6 +556,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
> de->name_len = 0;
> de->rec_len = ext2_rec_len_to_disk(chunk_size);
> de->inode = 0;
> + de->file_type = 0;
> goto got_it;
> }
> if (de->rec_len == 0) {
> @@ -545,6 +570,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
> name_len = EXT2_DIR_REC_LEN(de->name_len);
> rec_len = ext2_rec_len_from_disk(de->rec_len);
> if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
> + (de->file_type != EXT2_FT_FALLTHRU) &&
> (rec_len >= reclen))
> goto got_it;
> if (rec_len >= name_len + reclen)
> @@ -587,7 +613,8 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
>
> err = -EEXIST;
> if (ext2_match (namelen, name, de)) {
> - if (de->file_type == EXT2_FT_WHT)
> + if ((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU))
> goto got_it;
> goto out_unlock;
> }
> @@ -602,7 +629,8 @@ got_it:
> &page, NULL);
> if (err)
> goto out_unlock;
> - if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
> + if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) &&
> !ext2_match (namelen, name, de))) {
> ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> @@ -627,6 +655,60 @@ out_unlock:
> }
>
> /*
> + * Create a fallthru entry.
> + */
> +int ext2_fallthru_entry (struct inode *dir, struct dentry *dentry)
> +{
> + const char *name = dentry->d_name.name;
> + int namelen = dentry->d_name.len;
> + unsigned short rec_len, name_len;
> + ext2_dirent * de;
> + struct page *page;
> + loff_t pos;
> + int err;
> +
> + de = ext2_append_entry(dentry, &page);
> + if (IS_ERR(de))
> + return PTR_ERR(de);
> +
> + err = -EEXIST;
> + if (ext2_match (namelen, name, de))
> + goto out_unlock;
> +
> + name_len = EXT2_DIR_REC_LEN(de->name_len);
> + rec_len = ext2_rec_len_from_disk(de->rec_len);
> +
> + pos = page_offset(page) +
> + (char*)de - (char*)page_address(page);
> + err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
> + &page, NULL);
> + if (err)
> + goto out_unlock;
> + if (de->inode || (de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) {
> + ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> + de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> + de->rec_len = ext2_rec_len_to_disk(name_len);
> + de = de1;
> + }
> + de->name_len = namelen;
> + memcpy(de->name, name, namelen);
> + de->inode = 0;
> + de->file_type = EXT2_FT_FALLTHRU;
> + err = ext2_commit_chunk(page, pos, rec_len);
> + dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
> + EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
> + mark_inode_dirty(dir);
> + /* OFFSET_CACHE */
> +out_put:
> + ext2_put_page(page);
> + return err;
> +out_unlock:
> + unlock_page(page);
> + goto out_put;
> +}
> +
> +/*
> * ext2_delete_entry deletes a directory entry by merging it with the
> * previous entry. Page is up-to-date. Releases the page.
> */
> @@ -711,7 +793,9 @@ int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
> */
> if (ext2_match (namelen, name, de))
> de->inode = 0;
> - if (de->inode || (de->file_type == EXT2_FT_WHT)) {
> + if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) &&
> + !ext2_match (namelen, name, de))) {
> ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> de->rec_len = ext2_rec_len_to_disk(name_len);
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index a7f057f..328fc1c 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -108,6 +108,7 @@ extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *,
> extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
> extern int ext2_whiteout_entry (struct inode *, struct dentry *,
> struct ext2_dir_entry_2 *, struct page *);
> +extern int ext2_fallthru_entry (struct inode *, struct dentry *);
> extern int ext2_empty_dir (struct inode *);
> extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
> extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 9c4eef2..2ac44f1 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -333,6 +333,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
> goto out;
>
> spin_lock(&new_dentry->d_lock);
> + new_dentry->d_flags &= ~DCACHE_FALLTHRU;
> new_dentry->d_flags |= DCACHE_WHITEOUT;
> spin_unlock(&new_dentry->d_lock);
> d_add(new_dentry, NULL);
> @@ -351,6 +352,24 @@ out:
> return err;
> }
>
> +/*
> + * Create a fallthru entry.
> + */
> +static int ext2_fallthru (struct inode *dir, struct dentry *dentry)
> +{
> + int err;
> +
> + err = ext2_fallthru_entry(dir, dentry);
> + if (err)
> + return err;
> +
> + d_instantiate(dentry, NULL);
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> + return 0;
> +}
> +
> static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
> struct inode * new_dir, struct dentry * new_dentry )
> {
> @@ -451,6 +470,7 @@ const struct inode_operations ext2_dir_inode_operations = {
> .rmdir = ext2_rmdir,
> .mknod = ext2_mknod,
> .whiteout = ext2_whiteout,
> + .fallthru = ext2_fallthru,
> .rename = ext2_rename,
> #ifdef CONFIG_EXT2_FS_XATTR
> .setxattr = generic_setxattr,
> diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
> index bd10826..f6b68ec 100644
> --- a/include/linux/ext2_fs.h
> +++ b/include/linux/ext2_fs.h
> @@ -577,6 +577,7 @@ enum {
> EXT2_FT_SOCK,
> EXT2_FT_SYMLINK,
> EXT2_FT_WHT,
> + EXT2_FT_FALLTHRU,
> EXT2_FT_MAX
> };
>
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:17:53

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 33/41] fallthru: jffs2 fallthru support

In message <[email protected]>, Valerie Aurora writes:
> From: Felix Fietkau <[email protected]>
>
> Add support for fallthru dentries to jffs2.
>
> Cc: David Woodhouse <[email protected]>
> Cc: [email protected]
> Signed-off-by: Felix Fietkau <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/jffs2/dir.c | 31 ++++++++++++++++++++++++++++++-
> include/linux/jffs2.h | 6 ++++++
> 2 files changed, 36 insertions(+), 1 deletions(-)
>
> diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> index 46a2e1b..544d6c5 100644
> --- a/fs/jffs2/dir.c
> +++ b/fs/jffs2/dir.c
> @@ -35,6 +35,7 @@ static int jffs2_rename (struct inode *, struct dentry *,
> struct inode *, struct dentry *);
>
> static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
> +static int jffs2_fallthru (struct inode *, struct dentry *);
>
> const struct file_operations jffs2_dir_operations =
> {
> @@ -57,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations =
> .rmdir = jffs2_rmdir,
> .mknod = jffs2_mknod,
> .rename = jffs2_rename,
> + .fallthru = jffs2_fallthru,
> .whiteout = jffs2_whiteout,
> .permission = jffs2_permission,
> .setattr = jffs2_setattr,
> @@ -107,6 +109,9 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
> case DT_WHT:
> target->d_flags |= DCACHE_WHITEOUT;
> break;
> + case JFFS2_DT_FALLTHRU:
> + target->d_flags |= DCACHE_FALLTHRU;
> + break;
> default:
> ino = fd->ino;
> break;
> @@ -168,7 +173,10 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
> fd->name, fd->ino, fd->type, curofs, offset));
> continue;
> }
> - if (!fd->ino) {
> + if (fd->type == JFFS2_DT_FALLTHRU)
> + /* XXX Should really do a lookup for the real inode number here */
> + fd->ino = 100;

In the ext2 patch, it was ino=123, here it's 100. Is there a consistently
useful reserved number to use instead, for jffs2 as well? If not, maybe at
least we can pick one random inode number and use it for all default inode
numbers for ext2, jffs2, etc.?

> + else if (!fd->ino && (fd->type != DT_WHT)) {
> D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
> offset++;
> continue;
> @@ -797,6 +805,26 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
> return 0;
> }
>
> +static int jffs2_fallthru (struct inode *dir, struct dentry *dentry)
> +{
> + struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
> + uint32_t now;
> + int ret;
> +
> + now = get_seconds();
> + ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_UNKNOWN,
> + dentry->d_name.name, dentry->d_name.len, now);
> + if (ret)
> + return ret;
> +
> + d_instantiate(dentry, NULL);
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> +
> + return 0;
> +}
> +
> static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
> struct dentry *new_dentry)
> {
> @@ -830,6 +858,7 @@ static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
> return ret;
>
> spin_lock(&new_dentry->d_lock);
> + new_dentry->d_flags &= ~DCACHE_FALLTHRU;
> new_dentry->d_flags |= DCACHE_WHITEOUT;
> spin_unlock(&new_dentry->d_lock);
> d_add(new_dentry, NULL);
> diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
> index 65533bb..dbe8c93 100644
> --- a/include/linux/jffs2.h
> +++ b/include/linux/jffs2.h
> @@ -114,6 +114,12 @@ struct jffs2_unknown_node
> jint32_t hdr_crc;
> };
>
> +/*
> + * Non-standard directory entry type(s), for on-disk use
> + */
> +
> +#define JFFS2_DT_FALLTHRU (DT_WHT + 1)
> +
> struct jffs2_raw_dirent
> {
> jint16_t magic;
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:18:09

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 34/41] fallthru: tmpfs fallthru support

In message <[email protected]>, Valerie Aurora writes:
> Add support for fallthru directory entries to tmpfs

Need to CC tmpfs maintainers here.

> XXX - Makes up inode number for dirent
>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/dcache.c | 3 +-
> fs/libfs.c | 21 +++++++++++++++++--
> mm/shmem.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 73 insertions(+), 11 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index ca8a661..8ef2d89 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2292,7 +2292,8 @@ resume:
> struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> next = tmp->next;
> if (d_unhashed(dentry)||(!dentry->d_inode &&
> - !d_is_whiteout(dentry)))
> + !d_is_whiteout(dentry) &&
> + !d_is_fallthru(dentry)))
> continue;
> if (!list_empty(&dentry->d_subdirs)) {
> this_parent = dentry;
> diff --git a/fs/libfs.c b/fs/libfs.c
> index dcec3d3..01f3e73 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -133,6 +133,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
> struct dentry *cursor = filp->private_data;
> struct list_head *p, *q = &cursor->d_u.d_child;
> ino_t ino;
> + int d_type;
> int i = filp->f_pos;
>
> switch (i) {
> @@ -158,14 +159,28 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
> for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
> struct dentry *next;
> next = list_entry(p, struct dentry, d_u.d_child);
> - if (d_unhashed(next) || !next->d_inode)
> + if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
> continue;
>
> + if (d_is_fallthru(next)) {
> + /* XXX We don't know the inode
> + * number of the directory
> + * entry in the underlying
> + * file system. Should look
> + * it up, either on fallthru
> + * creation at first readdir
> + * or now at filldir time. */
> + ino = 123; /* Made up ino */

Ok, so here it's 123, as in ext2, but not jffs2, who had it set to 100...

> + d_type = DT_UNKNOWN;
> + } else {
> + ino = next->d_inode->i_ino;
> + d_type = dt_type(next->d_inode);
> + }
> +
> spin_unlock(&dcache_lock);
> if (filldir(dirent, next->d_name.name,
> next->d_name.len, filp->f_pos,
> - next->d_inode->i_ino,
> - dt_type(next->d_inode)) < 0)
> + ino, d_type) < 0)
> return 0;
> spin_lock(&dcache_lock);
> /* next is still alive */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2faa14b..4f4b4b6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1798,8 +1798,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
> static int shmem_unlink(struct inode *dir, struct dentry *dentry);
>
> /*
> - * This is the whiteout support for tmpfs. It uses one singleton whiteout
> - * inode per superblock thus it is very similar to shmem_link().
> + * Create a dentry to signify a whiteout.
> */
> static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
> struct dentry *new_dentry)
> @@ -1830,8 +1829,10 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
> spin_unlock(&sbinfo->stat_lock);
> }
>
> - if (old_dentry->d_inode) {
> - if (S_ISDIR(old_dentry->d_inode->i_mode))
> + if (old_dentry->d_inode || d_is_fallthru(old_dentry)) {
> + /* A fallthru for a dir is treated like a regular link */
> + if (old_dentry->d_inode &&
> + S_ISDIR(old_dentry->d_inode->i_mode))
> shmem_rmdir(dir, old_dentry);
> else
> shmem_unlink(dir, old_dentry);
> @@ -1848,6 +1849,48 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
> }
>
> static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
> + struct inode *inode);
> +
> +/*
> + * Create a dentry to signify a fallthru. A fallthru in tmpfs is the
> + * logical equivalent of an in-kernel readdir() cache. It can't be
> + * deleted until the file system is unmounted.
> + */
> +static int shmem_fallthru(struct inode *dir, struct dentry *dentry)
> +{
> + struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
> +
> + /* FIXME: this is stupid */
> + if (!(dir->i_sb->s_flags & MS_WHITEOUT))
> + return -EPERM;
> +
> + if (dentry->d_inode || d_is_fallthru(dentry) || d_is_whiteout(dentry))
> + return -EEXIST;
> +
> + /*
> + * Each new link needs a new dentry, pinning lowmem, and tmpfs
> + * dentries cannot be pruned until they are unlinked.
> + */
> + if (sbinfo->max_inodes) {
> + spin_lock(&sbinfo->stat_lock);
> + if (!sbinfo->free_inodes) {
> + spin_unlock(&sbinfo->stat_lock);
> + return -ENOSPC;
> + }
> + sbinfo->free_inodes--;
> + spin_unlock(&sbinfo->stat_lock);
> + }
> +
> + shmem_d_instantiate(dir, dentry, NULL);
> + dir->i_ctime = dir->i_mtime = CURRENT_TIME;
> +
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> + return 0;
> +}
> +
> +static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
> struct inode *inode)
> {
> if (d_is_whiteout(dentry)) {
> @@ -1855,14 +1898,15 @@ static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
> shmem_free_inode(dir->i_sb);
> if (S_ISDIR(inode->i_mode))
> inode->i_mode |= S_OPAQUE;
> + } else if (d_is_fallthru(dentry)) {
> + shmem_free_inode(dir->i_sb);
> } else {
> /* New dentry */
> dir->i_size += BOGO_DIRENT_SIZE;
> dget(dentry); /* Extra count - pin the dentry in core */
> }
> - /* Will clear DCACHE_WHITEOUT flag */
> + /* Will clear DCACHE_WHITEOUT and DCACHE_FALLTHRU flags */
> d_instantiate(dentry, inode);
> -
> }
> /*
> * File creation. Allocate an inode, and we're done..
> @@ -1947,7 +1991,8 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
> {
> struct inode *inode = dentry->d_inode;
>
> - if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
> + if (d_is_whiteout(dentry) || d_is_fallthru(dentry) ||
> + (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
> shmem_free_inode(dir->i_sb);

I'd reorder this || condition above so the more common sub-conditions to be
true, show up first (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode).
d_is_whatever should go last.

>
> if (inode) {
> @@ -2583,6 +2628,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
> .mknod = shmem_mknod,
> .rename = shmem_rename,
> .whiteout = shmem_whiteout,
> + .fallthru = shmem_fallthru,
> #endif
> #ifdef CONFIG_TMPFS_POSIX_ACL
> .setattr = shmem_notify_change,
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:19:17

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 35/41] union-mount: Copy up directory entries on first readdir()

In message <[email protected]>, Valerie Aurora writes:
> readdir() in union mounts is implemented by copying up all visible
> directory entries from the lower level directories to the topmost
> directory. Directory entries that refer to lower level file system
> objects are marked as "fallthru" in the topmost directory.
>
> Thanks to Felix Fietkau <[email protected]> for a bug fix.
>
> XXX - Do we need i_mutex on lower layer?
> XXX - Rewrite for two layers only?
>
> Signed-off-by: Valerie Aurora <[email protected]>
> Signed-off-by: Felix Fietkau <[email protected]>
> ---
> fs/readdir.c | 17 +++++
> fs/union.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/union.h | 2 +
> 3 files changed, 190 insertions(+), 0 deletions(-)
>
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 3a48491..cfeacd8 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -16,6 +16,8 @@
> #include <linux/security.h>
> #include <linux/syscalls.h>
> #include <linux/unistd.h>
> +#include <linux/union.h>
> +#include <linux/mount.h>
>
> #include <asm/uaccess.h>
>
> @@ -36,9 +38,24 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
>
> res = -ENOENT;
> if (!IS_DEADDIR(inode)) {
> + /*
> + * XXX Think harder about locking for
> + * union_copyup_dir. Currently we lock the topmost

This is going back to the issue of needed all lower layers to be
really-really readonly.

> + * directory and hold that lock while sequentially
> + * acquiring and dropping locks for the directories
> + * below this one in the union stack.
> + */
> + if (is_unionized(file->f_path.dentry, file->f_path.mnt) &&
> + !IS_OPAQUE(inode) && IS_MNT_UNION(file->f_path.mnt)) {
> + res = union_copyup_dir(&file->f_path);
> + if (res)
> + goto out_unlock;
> + }
> +
> res = file->f_op->readdir(file, buf, filler);
> file_accessed(file);
> }
> +out_unlock:
> mutex_unlock(&inode->i_mutex);
> out:
> return res;
> diff --git a/fs/union.c b/fs/union.c
> index de31fc9..d56b829 100644
> --- a/fs/union.c
> +++ b/fs/union.c
> @@ -5,6 +5,7 @@
> * Copyright (C) 2007-2009 Novell Inc.
> *
> * Author(s): Jan Blunck ([email protected])
> + * Valerie Aurora <[email protected]>

Hmm, maybe Red Hat wants a Copyright mention as well?

> *
> * This program is free software; you can redistribute it and/or modify it
> * under the terms of the GNU General Public License as published by the Free
> @@ -777,3 +778,173 @@ void detach_mnt_union(struct vfsmount *mnt)
> union_put(um);
> return;
> }
> +
> +/**
> + * union_copyup_dir_one - copy up a single directory entry
> + *
> + * Individual directory entry copyup function for union_copyup_dir.
> + * We get the entries from higher level layers first.
> + */
> +
> +static int union_copyup_dir_one(void *buf, const char *name, int namlen,
> + loff_t offset, u64 ino, unsigned int d_type)
> +{
> + struct dentry *topmost_dentry = (struct dentry *) buf;
> + struct dentry *dentry;
> + int err = 0;
> +
> + switch (namlen) {
> + case 2:
> + if (name[1] != '.')
> + break;
> + case 1:
> + if (name[0] != '.')
> + break;
> + return 0;
> + }
> +
> + /* Lookup this entry in the topmost directory */
> + dentry = lookup_one_len(name, topmost_dentry, namlen);
> +
> + if (IS_ERR(dentry)) {
> + printk(KERN_INFO "error looking up %s\n", dentry->d_name.name);
> + goto out;
> + }
> +
> + /*
> + * If the entry already exists, one of the following is true:
> + * it was already copied up (due to an earlier lookup), an
> + * entry with the same name already exists on the topmost file
> + * system, it is a whiteout, or it is a fallthru. In each
> + * case, the top level entry masks any entries from lower file
> + * systems, so don't copy up this entry.
> + */
> + if (dentry->d_inode || d_is_whiteout(dentry) ||
> + d_is_fallthru(dentry)) {
> + printk(KERN_INFO "skipping copy of %s\n", dentry->d_name.name);

Do we really need this printk here? Is it more of a KERN_DEBUG printk or
really just an _INFO? Either way, I suggest all UM printk's be prefixed by
something like "um: " so it's easy to grep for them in system/console logs.

> + goto out_dput;
> + }
> +
> + /*
> + * If the entry doesn't exist, create a fallthru entry in the
> + * topmost file system. All possible directory types are
> + * used, so each file system must implement its own way of
> + * storing a fallthru entry.
> + */
> + printk(KERN_INFO "creating fallthru for %s\n", dentry->d_name.name);
> + err = topmost_dentry->d_inode->i_op->fallthru(topmost_dentry->d_inode,
> + dentry);
> + /* FIXME */
> + BUG_ON(err);

BUG_ON is too extreme here. Just return an error to the caller and be sure
it gets handled properly there.

> + /*
> + * At this point, we have a negative dentry marked as fallthru
> + * in the cache. We could potentially lookup the entry lower
> + * level file system and turn this into a positive dentry
> + * right now, but it is not clear that would be a performance
> + * win and adds more opportunities to fail.
> + */
> +out_dput:
> + dput(dentry);
> +out:
> + return 0;
> +}
> +
> +/**
> + * union_copyup_dir - copy up low-level directory entries to topmost dir
> + *
> + * readdir() is difficult to support on union file systems for two
> + * reasons: We must eliminate duplicates and apply whiteouts, and we
> + * must return something in f_pos that lets us restart in the same
> + * place when we return. Our solution is to, on first readdir() of
> + * the directory, copy up all visible entries from the low-level file
> + * systems and mark the entries that refer to low-level file system
> + * objects as "fallthru" entries.
> + */
> +
> +int union_copyup_dir(struct path *topmost_path)
> +{
> + struct dentry *topmost_dentry = topmost_path->dentry;
> + struct path path = *topmost_path;
> + int res = 0;
> +
> + /*
> + * Skip opaque dirs.
> + */
> + if (IS_OPAQUE(topmost_dentry->d_inode))
> + return 0;
> +
> + res = mnt_want_write(topmost_path->mnt);
> + if (res)
> + return res;
> +
> + /*
> + * Mark this dir opaque to show that we have already copied up
> + * the lower entries. Only fallthru entries pass through to
> + * the underlying file system.
> + *

> + * XXX Deal with the lower file system changing. This could
> + * be through running a tool over the top level file system to
> + * make directories transparent again, or we could check the
> + * mtime of the underlying directory.

Yikes, why the mention of this cache coherency issue here? If it's so
important, then why not mention it everywhere and in the design doc? I
personally think trying to solve the cache-coherency in layers is too much
work all at once: focus on basic UM functionality first. So I'd remove this
comment from here, and add some discussion of cache coherency issues under a
"Limitations" section of the design doc.

> + */
> +
> + topmost_dentry->d_inode->i_flags |= S_OPAQUE;
> + mark_inode_dirty(topmost_dentry->d_inode);
> +
> + /*
> + * Loop through each dir on each level copying up the entries
> + * to the topmost.
> + */
> +
> + /* Don't drop the caller's reference to the topmost path */
> + path_get(&path);
> + while (follow_union_down(&path.mnt, &path.dentry)) {
> + struct file * ftmp;
> + struct inode * inode;
> +
> + /* XXX Permit fallthrus on lower-level? Would need to
> + * pass in opaque flag to union_copyup_dir_one() and
> + * only copy up fallthru entries there. We allow
> + * fallthrus in lower level opaque directories on
> + * lookup, so for consistency we should do one or the
> + * other in both places. */
> + if (IS_OPAQUE(path.dentry->d_inode))
> + break;
> +
> + /* dentry_open() doesn't get a path reference itself */
> + path_get(&path);
> + ftmp = dentry_open(path.dentry, path.mnt,
> + O_RDONLY | O_DIRECTORY | O_NOATIME,
> + current_cred());
> + if (IS_ERR(ftmp)) {
> + printk (KERN_ERR "unable to open dir %s for "
> + "directory copyup: %ld\n",
> + path.dentry->d_name.name, PTR_ERR(ftmp));
> + continue;
> + }
> +
> + inode = path.dentry->d_inode;
> + mutex_lock(&inode->i_mutex);
> +
> + res = -ENOENT;
> + if (IS_DEADDIR(inode))
> + goto out_fput;
> + /*
> + * Read the whole directory, calling our directory
> + * entry copyup function on each entry. Pass in the
> + * topmost dentry as our private data so we can create
> + * new entries in the topmost directory.
> + */
> + res = ftmp->f_op->readdir(ftmp, topmost_dentry,
> + union_copyup_dir_one);
> +out_fput:

You can eliminate this out_fput label label here by rewriting the code:

if (!IS_DEADDIR(inode))
res = ftmp->f_op->readdir(ftmp, topmost_dentry,
union_copyup_dir_one);

> + mutex_unlock(&inode->i_mutex);
> + fput(ftmp);
> +
> + if (res)
> + break;
> + }
> + path_put(&path);
> + mnt_drop_write(topmost_path->mnt);
> + return res;
> +}
> diff --git a/include/linux/union.h b/include/linux/union.h
> index 405baa9..a0656b3 100644
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -57,6 +57,7 @@ extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
> struct path *);
> extern int __union_copyup(struct path *, struct nameidata *, struct path *);
> extern int union_copyup(struct nameidata *, int);
> +extern int union_copyup_dir(struct path *path);
>
> #else /* CONFIG_UNION_MOUNT */
>
> @@ -74,6 +75,7 @@ extern int union_copyup(struct nameidata *, int);
> #define union_create_topmost(x, y, z) ({ BUG(); (NULL); })
> #define __union_copyup(x, y, z) ({ BUG(); (0); })
> #define union_copyup(x, y) ({ (0); })
> +#define union_copyup_dir(x) ({ BUG(); (0); })
>
> #endif /* CONFIG_UNION_MOUNT */
> #endif /* __KERNEL__ */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:27:03

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 38/41] union-mount: Make pivot_root work with union mounts

In message <[email protected]>, Valerie Aurora writes:
> When moving a union mount, follow it down to the bottom layer and move
> that instead of just the top layer.
>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namespace.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 9b71743..6ac5fc1 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2282,6 +2282,15 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
> if (d_unlinked(old.dentry))
> goto out2;
> error = -EBUSY;
> + /*
> + * follow_union_down() only goes one layer down. We want the
> + * bottom-most layer here - if we move that around, all the
> + * layers on top move with it. But if we ever allow more than
> + * two layers, the below two will both need to be in while()
> + * loops.
> + */

Given the nature of this comment, I'd stick an "XXX" there for easier
ability to grep-the-src-for-issues.

> + follow_union_down(&new.mnt, &new.dentry);
> + follow_union_down(&root.mnt, &root.dentry);
> if (new.mnt == root.mnt ||
> old.mnt == root.mnt)
> goto out2; /* loop, on the same file system */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:35:28

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 39/41] union-mount: Ignore read-only file system in permission checks

In message <[email protected]>, Valerie Aurora writes:
> In certain cases, we check a file for write access before it has been
> copied up to the top-level fs. We don't want to fail because the
> bottom layer is read-only - of course it is - so skip that check in
> those cases.
>
> Thanks to Felix Fietkau <[email protected]> for a bug fix.
>
> XXX - Document when to call union_permission() vs. inode_permission()
> XXX - Kinda gross. Probably a simpler solution.
>
> Signed-off-by: Valerie Aurora <[email protected]>
> ---
> fs/namei.c | 21 +++++++++++++++++----
> fs/open.c | 8 ++++++--
> fs/union.c | 32 ++++++++++++++++++++++++++++++--
> include/linux/fs.h | 1 +
> include/linux/union.h | 2 ++
> 5 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 61e94aa..a8d3acf 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -230,16 +230,17 @@ int generic_permission(struct inode *inode, int mask,
> }
>
> /**
> - * inode_permission - check for access rights to a given inode
> + * __inode_permission - check for access rights to a given inode
> * @inode: inode to check permission on
> * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + * @rofs: check for read-only fs
> *
> * Used to check for read/write/execute permissions on an inode.
> * We use "fsuid" for this, letting us set arbitrary permissions
> * for filesystem access without changing the "normal" uids which
> * are used for other things.
> */
> -int inode_permission(struct inode *inode, int mask)
> +int __inode_permission(struct inode *inode, int mask, int rofs)
> {
> int retval;

rofs can be a boolean.

While I normally prefer to avoid magic flags passed to a function to change
its behavior, in this case it's a small and obvious change. I could use
your __inode_permission as is in Unionfs today, if it was upstream; in
Unionfs I had to copy inode_permission to my code, and remove the EROFS
test.

>
> @@ -249,7 +250,7 @@ int inode_permission(struct inode *inode, int mask)
> /*
> * Nobody gets write access to a read-only fs.
> */
> - if (IS_RDONLY(inode) &&
> + if ((rofs & IS_RDONLY(inode)) &&
> (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
> return -EROFS;
>
> @@ -277,6 +278,18 @@ int inode_permission(struct inode *inode, int mask)
> }
>
> /**
> + * inode_permission - check for access rights to a given inode
> + * @inode: inode to check permission on
> + * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * This version pays attention to the MS_RDONLY flag on the fs.
> + */
> +int inode_permission(struct inode *inode, int mask)
> +{
> + return __inode_permission(inode, mask, 1);
> +}
> +
> +/**
> * file_permission - check for additional access rights to a given file
> * @file: file to check access rights for
> * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> @@ -2129,7 +2142,7 @@ int may_open(struct path *path, int acc_mode, int flag)
> break;
> }
>
> - error = inode_permission(inode, acc_mode);
> + error = union_permission(path, acc_mode);
> if (error)
> return error;
>
> diff --git a/fs/open.c b/fs/open.c
> index dd98e80..3df5a1b 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -30,6 +30,7 @@
> #include <linux/audit.h>
> #include <linux/falloc.h>
> #include <linux/fs_struct.h>
> +#include <linux/union.h>
>
> int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> {
> @@ -333,6 +334,7 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
> error = security_path_truncate(&file->f_path, length,
> ATTR_MTIME|ATTR_CTIME);
> if (!error)
> + /* Already copied up for union, opened with write */
> error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
> out_putf:
> fput(file);
> @@ -493,7 +495,8 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
> goto out_path_release;
> }
>
> - res = inode_permission(inode, mode | MAY_ACCESS);
> + res = union_permission(&path, mode | MAY_ACCESS);
> +
> /* SuS v2 requires we report a read only fs too */
> if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
> goto out_path_release;
> @@ -507,7 +510,8 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
> * inherently racy and know that the fs may change
> * state before we even see this result.
> */
> - if (__mnt_is_readonly(path.mnt))
> + if ((!is_unionized(path.dentry, path.mnt) &&
> + (__mnt_is_readonly(path.mnt))))
> res = -EROFS;
>
> out_path_release:
> diff --git a/fs/union.c b/fs/union.c
> index d56b829..8d94b22 100644
> --- a/fs/union.c
> +++ b/fs/union.c
> @@ -390,6 +390,30 @@ static int union_relookup_topmost(struct nameidata *nd, int flags)
> return err;
> }
>
> +
> +/**
> + * union_permission - check for access rights to a given inode
> + * @inode: inode to check permission on
> + * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * In a union mount, the top layer is always read-write and the bottom
> + * is always read-only. Ignore the read-only flag on the lower fs.
> + *
> + * Only need for certain activities, like checking to see if write
> + * access is ok.
> + */
> +
> +int union_permission(struct path *path, int mask)
> +{
> + struct inode *inode = path->dentry->d_inode;
> +
> + if (!is_unionized(path->dentry, path->mnt))
> + return inode_permission(inode, mask);
> +
> + /* Tell __inode_permission to ignore MS_RDONLY */
> + return __inode_permission(inode, mask, 0);
> +}
> +
> /*
> * union_create_topmost - create the topmost path component
> * @nd: pointer to nameidata of the base directory
> @@ -489,6 +513,9 @@ static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
> if (IS_ERR(new_file))
> goto fput_old;
>
> + /* XXX be smart by using a length param, which indicates max
> + * data we'll want (e.g., we are about to truncate to 0 or 10
> + * bytes or something */

Useful comment, but not here: I'd put it right in the very first
copyup-related patch. And add it as "todo" to the design doc.

> size = i_size_read(old_file->f_path.dentry->d_inode);
> if (((size_t)size != size) || ((ssize_t)size != size)) {
> ret = -EFBIG;
> @@ -516,7 +543,8 @@ static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
> * The topmost directory @new_nd must already be locked. Creates the topmost
> * file if it doesn't exist yet.
> */
> -int __union_copyup(struct path *old, struct nameidata *new_nd, struct path *new)
> +int __union_copyup(struct path *old, struct nameidata *new_nd,
> + struct path *new)
> {
> struct dentry *dentry;
> int error;
> @@ -581,7 +609,7 @@ out_dput:
> * @nd: nameidata pointer to the file
> * @flags: flags given to open_namei
> */
> -int union_copyup(struct nameidata *nd, int flags)
> +int union_copyup(struct nameidata *nd, int flags /* XXX not used */)

If not used, then why not remove it?

> {
> struct qstr this;
> char *name;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 57690ab..38fb113 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2106,6 +2106,7 @@ extern void emergency_remount(void);
> extern sector_t bmap(struct inode *, sector_t);
> #endif
> extern int notify_change(struct dentry *, struct iattr *);
> +extern int __inode_permission(struct inode *inode, int mask, int rofs);
> extern int inode_permission(struct inode *, int);
> extern int generic_permission(struct inode *, int,
> int (*check_acl)(struct inode *, int));
> diff --git a/include/linux/union.h b/include/linux/union.h
> index a0656b3..92654e0 100644
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -58,6 +58,7 @@ extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
> extern int __union_copyup(struct path *, struct nameidata *, struct path *);
> extern int union_copyup(struct nameidata *, int);
> extern int union_copyup_dir(struct path *path);
> +extern int union_permission(struct path *, int);
>
> #else /* CONFIG_UNION_MOUNT */
>
> @@ -76,6 +77,7 @@ extern int union_copyup_dir(struct path *path);
> #define __union_copyup(x, y, z) ({ BUG(); (0); })
> #define union_copyup(x, y) ({ (0); })
> #define union_copyup_dir(x) ({ BUG(); (0); })
> +#define union_permission(x, y) inode_permission((x)->dentry->d_inode, y)
>
> #endif /* CONFIG_UNION_MOUNT */
> #endif /* __KERNEL__ */
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Erez.

2009-12-01 04:50:45

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 40/41] union-mount: Make truncate work in all its glorious UNIX variations

In message <[email protected]>, Valerie Aurora writes:
> Implement truncate(), ftruncate(), and open(O_TRUNC) for union mounts.
>
> This moves the union_copyup() in do_filp_open() down below may_open()
> - this way you don't copy up a file you don't even have permission to
> open.
>
> may_open() now takes a nameidata * because it may have to do a
> union_copyup() internally if O_TRUNC is specified. It's a trivial
> change, all callers were just doing "may_open(&nd.path, ...)" anyway.
> It kinda sucks, but may_open() auto-magically doing a truncate also
> sucks (may open? may truncate, too!).

Hmmm, perhaps may_open needs to be renamed then? (may_rename_and_truncate?)

> XXX - Only copy up the bytes that won't be truncated.
> XXX - Re-organize code. may_open() especially blah.
> XXX - truncate() implemented as in-kernel file open and ftruncate()
> XXX - Split up into smaller pieces
[...]

Erez.

2009-12-01 04:57:56

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 41/41] union-mount: Add support for rename by __union_copyup()

In message <[email protected]>, Valerie Aurora writes:
> From: Jan Blunck <[email protected]>
>
> It is possible to use __union_copyup() to support rename of regular files
> without returning -EXDEV.
>
> XXX - Rewrite as copyup to old name followed by rename() + whiteout()

All this code just to support rename by copyup?! I can see why we're
looking for other tricks, such as symlinks...

> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>

> +// error = hash_lookup_union(&oldnd, &oldnd.last, &old);
> + error = lookup_rename_source(&oldnd, &newnd, &trap, &oldnd.last, &old);

> +// if (is_unionized(newnd.path.dentry, newnd.path.mnt))
> +// goto exit5;

Nuke.

Erez.

2009-12-01 05:37:39

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 18/41] union-mount: Documentation

Val, I first read the documentation, but didn't comment on it until I've
read the rest of the patches. I won't repeat in detail what I've said in
the other patches regarding the documentation: listing short-term and
long-term tasks in order, adding a "limitations" section, etc. I'll try
here to focus only on new issues (other than "please spell check the
doc" :-)

In message <[email protected]>, Valerie Aurora writes:

> +Terminology
> +===========
> +
> +The main analogy for writable overlays is that a writable file system
> +is mounted "on top" of a read-only file system. Lookups start at the
> +"top" read-write file system and travel "down" to the "bottom"
> +read-only file system only if no blocking entry exists on the top
> +layer.
> +
> +Top layer: The read-write file system. Lookups begin here.
> +
> +Bottom layer: The read-only file system. Lookups end here.

Recall my gripes about terminology: top/bottom, upper/lower, this/next, etc.
The docs and srcs should use consistent terminology.

> +Path: Combination of the vfsmount and dentry structure.
> +
> +Follow down: Given a path from the top layer, find the corresponding
> +path on the bottom layer.
> +
> +Follow up: Given a path from the bottom layer, find the corresponding
> +path on the top layer.
> +
> +Whiteout: A directory entry in the top layer that prevents lookups
> +from travelling down to the bottom layer. Created on unlink()/rmdir()
> +if a corresponding directory entry exists in the bottom layer.
> +
> +Opaque: A flag on a directory in the top layer that prevents lookups
> +of entries in this directory from travelling down to the bottom
> +layer (unless there is an explicit fallthru entry allowing that for a
> +particular entry). Set on creation of a directory that replaces a
> +whiteout, and after a directory copyup.
> +
> +Fallthru: A directory entry which allows lookups to "fall through" to
> +the bottom layer for that exact directory entry. This serves as a
> +placeholder for directory entries from the bottom layer during
> +readdir(). Fallthrus override opaque flags.

The problem I have with this Terminology section is that it does more than
just define terms: it also describes their use. Because of that, you have a
chicken-and-egg text here; the description of Opaque refers to Fallthru and
vise verse, so there's no clean order to those two terms. I have to read
this section twice before I can understand it. Good text doesn't require
multiple passes (like a good compiler :-)

A better way would be to make his section JUST describe terms. And then
follow it with a section which describes HOW those terms are used and
interact with each other. That'll break the chick-and-egg cycles.

> +
> +File copyup: Create a file on the top layer that has the same properties
> +and contents as the file with the same pathname on the bottom layer.
> +
> +Directory copyup: Copy up the visible directory entries from the
> +bottom layer as fallthrus in the matching top layer directory. Mark
> +the directory opaque to avoid unnecessary negative lookups on the
> +bottom layer.
> +
> +Examples
> +========
> +
> +What happens when I...
> +
> +- creat() /newfile -> creates on top layer
> +- unlink() /oldfile -> creates a whiteout on top layer
> +- Edit /existingfile -> copies up to top layer at open(O_WR) time
> +- truncate /existingfile -> copies up to top layer + N bytes if specified
> +- touch()/chmod()/chown()/etc. -> copies up to top layer
> +- mkdir() /newdir -> creates on top layer
> +- rmdir() /olddir -> creates a whiteout on top layer
> +- mkdir() /olddir after above -> creates on top layer w/ opaque flag
> +- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
> +- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer
> +- symlink() /oldfile /symlink -> nothing special
> +- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
> +- rename() dir -> EXDEV

These examples are premature. You haven't yet described HOW the various ops
in UM work in sufficient detail. I'd move examples much further down.

Also, these examples come out of nowhere. You haven't describe the
environment for me in sufficient detail. What is /newfile and /oldfile and
/existingfile: in which layer(s) do they live?

Also, now that I've gone over the rest of the patches, there are
discrepancies b/t these examples and what your code does (e.g., patch 41
changed how rename behaves). Finally, I don't think all ops have been
defined here:

- ls/readdir/stat?
- open(O_WR)?

> +Getting to a root file system with a writable overlay:
> +
> +- Mount the base read-only file system as the root file system
> +- Mount the read-only file system again on /newroot
> +- Mount the writable overlay on /newroot:
> + # mount -o union /dev/sda /newroot
> +- pivot_root to /newroot
> +- Start init
> +
> +See scripts/pivot.sh in the UML devkit linked to from:
> +
> +http://valerieaurora.org/union/
> +
> +VFS implementation
> +==================
> +
> +Writable overlays are implemented as an integral part of the VFS,
> +rather than as a VFS client file system (i.e., a stacked file system
> +like unionfs or ecryptfs). Implementing writable overlays inside the
> +VFS eliminates the need for duplicate copies of VFS data structures,
> +unnecessary indirection, and code duplication, but requires very
> +maintainable, low-to-zero overhead code. Writable overlays require no
> +change to file systems serving as the read-only layer, and requires
> +some minor support from file systems serving as the read-write layer.
> +File systems that want to be the writable layer must implement the new
> +->whiteout() and ->fallthru() inode operations, which create special
> +dummy directory entries.
> +
> +union_mount structure
> +---------------------
> +
> +The primary data structure for writable overlays is the union_mount
> +structure, which connects overlapping directory dentries into a "union
> +stack":
> +
> +struct union_mount {
> + atomic_t u_count; /* reference count */
> + struct mutex u_mutex;
> + struct list_head u_unions; /* list head for d_unions */
> + struct list_head u_list; /* list head for mnt_unions */
> + struct hlist_node u_hash; /* list head for searching */
> + struct hlist_node u_rhash; /* list head for reverse searching */
> +
> + struct path u_this; /* this is me */
> + struct path u_next; /* this is what I overlay */
> +};
> +
> +The union_mount is referenced from the corresponding directory's
> +dentry:
> +
> +struct dentry {
> +[...]
> +#ifdef CONFIG_UNION_MOUNT
> + /*
> + * The following fields are used by the VFS based union mount
> + * implementation. Both are protected by union_lock!
> + */
> + struct list_head d_unions; /* list of union_mounts */
> + unsigned int d_unionized; /* unions referencing this dentry */
> +#endif
> +[...]
> +};
> +
> +Each top layer directory with the potential for a lookup to fall
> +through to the bottom layer has a union_mount structure stored in a
> +union_mount hash table. The union_mount's can be looked up both by the
> +top layer's path (via union_lookup()) and the bottom layer's path (via
> +union_rlookup()). Once you have the path (vfsmount and dentry pair)
> +of a file, the union stack can be followed down, layer by layer, with
> +follow_union_down(), and up with follow_union_mount().
> +
> +All union_mount's are allocated from a kmem cache when the
> +corresponding dentries are created. union_mount's are allocated when
> +the first referencing dentry is allocated and freed when all of the
> +referencing dentries are freed - that is, the dcache drives the union
> +cache. While writable overlays only use two layers, the union stack
> +infrastructure is capable of supporting an arbitrary number of file
> +system layers (leaving aside locking issues).
> +
> +Todo:
> +
> +- Rename union_mount structure - it's per directory, not per mount

You have tons of 'todo' items sprinkled throughout this doc and the sources.
It's ok for them to be in the sources, but please have one giant 'todo'
section here with every issue/item that needs to be addressed, so it's
easier for anyone to find out the current status of this project.

> +Userland support
> +================

About userland support: a while back I created a lightweight version of
unionfs which used native whiteouts, using an older version of Jan's patches
which supported whiteouts and opaque dirs natively in lower file systems. I
then found it then necessary to expose whiteouts and opaques to userland,
for testing/debugging purposes. I think you'll need to do the same: support
query/add/remove methods for whiteouts, opaques, and fallthrus. I did it
with top-level ioctls in fs/ioctl.c: it seemed like the most reasonable
option short of creating new syscalls. I can dig up my patches if you're
interested.

Exposing these to userland is important: you have to be able to write
regression suites which test that the kernel code properly creates
whiteouts, opaques, etc. You have to be able to hand-create small file
systems pre-populated with whiteouts et al, and test how the kernel handles
them (e.g., creat(2) of a file who was a whiteout before, or mkdir() of an
opaque'd dir). All this would be useful for the long term LTP-like
test-ability of UM.

----

Finally, Val, thanks for taking over this project, for the code and
documentation, and the ongoing efforts. Good luck.

Sincerely,
Erez.

2009-12-10 20:21:01

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 01/41] VFS: BUG() if somebody tries to rehash an already hashed dentry

On Sun, Nov 29, 2009 at 08:43:58PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > Break early when somebody tries to rehash an already hashed dentry.
> > Otherwise this leads to interesting corruptions in the dcache hash table
> > later on.
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/dcache.c | 1 +
> > 1 files changed, 1 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 9e5cd3c..38bf982 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1550,6 +1550,7 @@ void d_rehash(struct dentry * entry)
> > {
> > spin_lock(&dcache_lock);
> > spin_lock(&entry->d_lock);
> > + BUG_ON(!d_unhashed(entry));
> > _d_rehash(entry);
> > spin_unlock(&entry->d_lock);
> > spin_unlock(&dcache_lock);
>
> This patch seems unrelated to union mounts. If so, can you get it pushed
> upstream sooner? Or is this a debugging patch useful only when developing
> union mounts?
>
> You also said that it can lead to "ineresting corruptions". What kind of
> corruptions exactly? Also, would it make more sense to allow _d_rehash() to
> hash in an unhashed dentry for the first time?

Hi Erez,

Thanks for your great review! I am working my way through your
comments one by one.

This is a trivial patch which happened to be useful during our
development and seems like it might be useful for other VFS-related
development. I will submit it as part of our VFS patch set and drop
it if the maintainers don't want it.

I don't have an opinion on _d_rehash(), I'm afraid.

-VAL

2009-12-10 21:24:05

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 03/41] VFS: Make lookup_hash() return a struct path

On Sun, Nov 29, 2009 at 09:02:31PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > This patch changes lookup_hash() into returning a struct path.
>
> Actually, lookup_hash now also takes a qstr.
>
> This is a somewhat involved patch. I think more documentation is needed to
> list all the places it touches and changes, b/c now struct path has to
> propagate in various other places. (In general, passing struct path instead
> of struct dentry is going in the right direction: eventually we could get rid
> of lookup_one_len.)

Hm, it seems like a straightforward next step in the long-term project
of migration from dentries to paths. I looked at some of the previous
dentry->path patches and they didn't include this kind of documentation.

> > @@ -1219,14 +1219,22 @@ out:
> > * needs parent already locked. Doesn't follow mounts.
> > * SMP-safe.
> > */
> > -static struct dentry *lookup_hash(struct nameidata *nd)
> > +static int lookup_hash(struct nameidata *nd, struct qstr *name,
> > + struct path *path)
> > {
>
> I suggest you document above this function what the @name and @path are for,
> who is supposed to allocate and free them, caller/callee's responsibilities,
> side effects (if any), new return status upon success/failure, etc.

That would be good, but consistently documenting existing VFS
functionality would be a large project and not one I'm going to take
on. :)

-VAL

2009-12-10 21:24:56

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 03/41] VFS: Make lookup_hash() return a struct path

On Mon, Nov 30, 2009 at 01:04:13AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
>
> > @@ -1937,7 +1942,8 @@ EXPORT_SYMBOL(filp_open);
> > */
> > struct dentry *lookup_create(struct nameidata *nd, int is_dir)
> > {
> > - struct dentry *dentry = ERR_PTR(-EEXIST);
> > + struct path path = { .dentry = ERR_PTR(-EEXIST) } ;
>
> I assume the compiler will initialize path.mnt to NULL. Is NULL what you
> want? Even if the compiler guarantees it, I think you should either
> explicitly init .mnt to NULL or leave a comment explaining what's going on
> -- so no future code reader will think that this was omitted; a comment can
> clarify your intentions more explicitly.

That is an unpleasant thing to look at. I rewrote the exit paths so
that this initialization was unnecessary.

Thanks,

-VAL

2009-12-10 21:30:11

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 04/41] VFS: Remove unnecessary micro-optimization in cached_lookup()

On Sun, Nov 29, 2009 at 09:07:39PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > d_lookup() takes rename_lock which is a seq_lock. This is so cheap
> > it's not worth calling lockless __d_lookup() first from
> > cache_lookup(). Rename cached_lookup() to cache_lookup() while we're
> > there.
>
> Val, this is another patch unrelated to union mounts, an
> optimization/simplification of the VFS code. I think you need to try and
> push such VFS patches upstream more quickly, so as to reduce the set of UM
> patches you have to maintain.

I agree. We posted them separately once and will do so again.

-VAL

2010-01-20 23:36:44

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 06/41] VFS: Introduce dput() variant that maintains a kill-list

On Sun, Nov 29, 2009 at 09:28:40PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > This patch introduces a new variant of dput(). This becomes necessary to
> > prevent a recursive call to dput() from the union mount code.
> >
> > void __dput(struct dentry *dentry, struct list_head *list, int greedy);
> > struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
> > int greedy);
> >
> > __dput() works mostly like the original dput() did. The main difference is
> > that if it the greedy argument is zero it will put the parent on a special
> > list instead of trying to get rid of it directly.
> >
> > Therefore the union mount code can safely call __dput() when it wants to get
> > rid of underlying dentry references during a dput(). After calling __dput()
> > or __d_kill() the caller must make sure that __d_kill_final() is called on all
> > dentries on the kill list. __d_kill_final() is actually doing the
> > dentry_iput() and is also dereferencing the parent.
>
> From the description above, there is something somewhat unclean about all
> the special things that now have to happen: a special flags to affect how a
> function behaves, an extra requirement on the caller of __d_kill, etc. I
> wonder if there is a clear way to achieve this.

I looked into this some more, and looks like this patch might be
unnecessary with the current code base.

We were worried about a recursive dput() call through:

dput()->d_kill()->shrink_d_unions()->union_put()->dput()

But this path can only be reached if the dentry is unhashed when we
enter the first dput(), and it can only be unhashed if it was
rmdir()'d, and that means we called d_delete(), and d_delete() calls
shrink_d_unions() for us. So if we do call d_kill() from dput(), the
unions are already gone and there is no danger of calling dput()
again.

Jan, does this make sense? If not, do you have a test case that
triggers a recursive dput()?

Thanks,

-VAL

> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/dcache.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > 1 files changed, 105 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 38bf982..3415e9e 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -157,14 +157,19 @@ static void dentry_lru_del_init(struct dentry *dentry)
> > }
> >
> > /**
> > - * d_kill - kill dentry and return parent
> > + * __d_kill - kill dentry and return parent
> > * @dentry: dentry to kill
> > + * @list: kill list
> > + * @greedy: return parent instead of putting it on the kill list
> > *
> > * The dentry must already be unhashed and removed from the LRU.
> > *
> > - * If this is the root of the dentry tree, return NULL.
> > + * If this is the root of the dentry tree, return NULL. If greedy is zero, we
> > + * put the parent of this dentry on the kill list instead. The callers must
> > + * make sure that __d_kill_final() is called on all dentries on the kill list.
> > */
> > -static struct dentry *d_kill(struct dentry *dentry)
> > +static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list,
> > + int greedy)
>
> If you're keeping 'greedy' then perhaps make it a bool instead of 'int';
> that way you don't have to pass an unclear '0' or '1' in the rest of the
> code.
>
> > +void __dput(struct dentry *, struct list_head *, int);
>
> Can you move the __dput() code here and avoid the forward function
> declaration?
>
> Can __dput() be made static, or you need to call it from elsewhere. I
> didn't see an extern for it in this patch. If there's an extern in another
> patch, then it should be moved here.
>
> > +static void __d_kill_final(struct dentry *dentry, struct list_head *list)
> > +{
>
> Your patch header says that the caller of __dput or _-d_kill must called
> __d_kill_final. So shouldn't this be a non-static extern'ed function?
>
> Either way, I suggest documenting in a comment above __d_kill_final() who
> should call it and under what circumstances.
>
>
> > + iput(inode);
> > + }
> > +
> > + if (IS_ROOT(dentry))
> > + parent = NULL;
> > + else
> > + parent = dentry->d_parent;
> > + d_free(dentry);
> > + __dput(parent, list, 1);
> > +}
> > +
> > +/**
> > + * d_kill - kill dentry and return parent
> > + * @dentry: dentry to kill
> > + *
> > + * The dentry must already be unhashed and removed from the LRU.
> > + *
> > + * If this is the root of the dentry tree, return NULL.
> > + */
> > +static struct dentry *d_kill(struct dentry *dentry)
> > +{
> > + LIST_HEAD(mortuary);
> > + struct dentry *parent;
> > +
> > + parent = __d_kill(dentry, &mortuary, 1);
> > + while (!list_empty(&mortuary)) {
> > + dentry = list_entry(mortuary.next, struct dentry, d_lru);
> > + list_del(&dentry->d_lru);
> > + __d_kill_final(dentry, &mortuary);
> > + }
> > +
> > + return parent;
> > +}
> > +
> > /*
> > * This is dput
> > *
> > @@ -199,19 +266,24 @@ static struct dentry *d_kill(struct dentry *dentry)
> > * Real recursion would eat up our stack space.
> > */
> >
> > -/*
> > - * dput - release a dentry
> > - * @dentry: dentry to release
> > +/**
> > + * __dput - release a dentry
> > + * @dentry: dentry to release
> > + * @list: kill list argument for __d_kill()
> > + * @greedy: greedy argument for __d_kill()
> > *
> > * Release a dentry. This will drop the usage count and if appropriate
> > * call the dentry unlink method as well as removing it from the queues and
> > * releasing its resources. If the parent dentries were scheduled for release
> > - * they too may now get deleted.
> > + * they too may now get deleted if @greedy is not zero. Otherwise parent is
> > + * added to the kill list. The callers must make sure that __d_kill_final() is
> > + * called on all dentries on the kill list.
> > + *
> > + * You probably want to use dput() instead.
> > *
> > * no dcache lock, please.
> > */
> > -
> > -void dput(struct dentry *dentry)
> > +void __dput(struct dentry *dentry, struct list_head *list, int greedy)
> > {
>
> I wonder now if the "__" prefix in __dput is appropriate: usually it's
> reserved for "hidden" internal functions that are not supposed to be called
> by other users, right? I try to avoid naming things FOO and __FOO because
> the name alone doesn't help me understand what each one might be doing. So
> maybe rename __dput() to something more descriptive?
>
> Erez.

2010-01-21 00:20:35

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 09/41] whiteout: Don't return information about whiteouts to userspace

On Sun, Nov 29, 2009 at 09:53:30PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > The userspace isn't ready for handling another filetype. Therefore this
> > patch lets readdir() and others skip over the whiteout directory entries
> > they might find.
>
> The NFSD maintainers and MLs should be CC'ed on such patches which touch
> fs/nfsd/. I'd also suggst you change the subject line of this patch to:
>
> whiteout/NFSD: don't return ...

Thanks, I made these and your other suggested changes below.

-VAL

> This patch seems fairly straightforward: it returns 0 when d_type is DT_WHT.
> As long as there's no way to create such whiteout entries (not until UM is
> used), then there's no harm in pushing such patches upstream, no?
>
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/compat.c | 9 +++++++++
> > fs/nfsd/nfs3xdr.c | 5 +++++
> > fs/nfsd/nfs4xdr.c | 2 +-
> > fs/nfsd/nfsxdr.c | 4 ++++
> > fs/readdir.c | 9 +++++++++
> > 5 files changed, 28 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/compat.c b/fs/compat.c
> > index 6d6f98f..43f6102 100644
> > --- a/fs/compat.c
> > +++ b/fs/compat.c
> > @@ -847,6 +847,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
> > struct compat_old_linux_dirent __user *dirent;
> > compat_ulong_t d_ino;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > if (buf->result)
> > return -EINVAL;
> > d_ino = ino;
> > @@ -918,6 +921,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
> > compat_ulong_t d_ino;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > @@ -1007,6 +1013,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
> > int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
> > u64 off;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 01d4ec1..59576d0 100644
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -884,6 +884,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
> > int elen; /* estimated entry length in words */
> > int num_entry_words = 0; /* actual number of words */
> >
> > + if (d_type == DT_WHT) {
> > + cd->common.err = nfs_ok;
> > + return 0;
> > + }
> > +
> > if (cd->offset) {
> > u64 offset64 = offset;
> >
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 2dcc7fe..8c25012 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -2263,7 +2263,7 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
> > __be32 nfserr = nfserr_toosmall;
> >
> > /* In nfsv4, "." and ".." never make it onto the wire.. */
> > - if (name && isdotent(name, namlen)) {
> > + if (d_type == DT_WHT || (name && isdotent(name, namlen))) {
>
> Optimization: I would swap the order of the two conditions separated by the
> '||': the right-hand-side condition is far more likely to occur than
> d_type==DT_WHT, so you can enter the body of the 'if' more quickly for the
> common case.
>
> > cd->common.err = nfs_ok;
> > return 0;
> > }
> > diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> > index afd08e2..a7d622c 100644
> > --- a/fs/nfsd/nfsxdr.c
> > +++ b/fs/nfsd/nfsxdr.c
> > @@ -513,6 +513,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
> > namlen, name, offset, ino);
> > */
> >
> > + if (d_type == DT_WHT) {
> > + cd->common.err = nfs_ok;
> > + return 0;
> > + }
> > if (offset > ~((u32) 0)) {
> > cd->common.err = nfserr_fbig;
> > return -EINVAL;
> > diff --git a/fs/readdir.c b/fs/readdir.c
> > index 7723401..3a48491 100644
> > --- a/fs/readdir.c
> > +++ b/fs/readdir.c
> > @@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> > struct old_linux_dirent __user * dirent;
> > unsigned long d_ino;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > if (buf->result)
> > return -EINVAL;
> > d_ino = ino;
> > @@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> > unsigned long d_ino;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > @@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
> > struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > --
> > 1.6.3.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> Erez.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-01-21 00:36:22

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 10/41] whiteout: Add vfs_whiteout() and whiteout inode operation

On Sun, Nov 29, 2009 at 10:04:16PM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > Simply white-out a given directory entry. This functionality is usually used
> > in the sense of unlink. Therefore the given dentry can still be in-use and
> > contains an in-use inode. The filesystems inode operation has to do what
> > unlink or rmdir would in that case. Since the dentry still might be in-use
> > we have to provide a fresh unhashed dentry that is used as the whiteout
> > dentry instead. The given dentry is dropped and the whiteout dentry is
> > rehashed instead.
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/dcache.c | 4 +-
> > fs/namei.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dcache.h | 6 +++
> > include/linux/fs.h | 3 +
> > 4 files changed, 116 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 3415e9e..0fcae4b 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1076,8 +1076,10 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
> > /* the caller must hold dcache_lock */
> > static void __d_instantiate(struct dentry *dentry, struct inode *inode)
> > {
> > - if (inode)
> > + if (inode) {
> > + dentry->d_flags &= ~DCACHE_WHITEOUT;
> > list_add(&dentry->d_alias, &inode->i_dentry);
> > + }
> > dentry->d_inode = inode;
> > fsnotify_d_instantiate(dentry, inode);
> > }
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 46cf1cb..d2fc8c9 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2169,6 +2169,110 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
> > return sys_mkdirat(AT_FDCWD, pathname, mode);
> > }
> >
> > +
> > +/* Checks on the victim for whiteout */
> > +static inline int may_whiteout(struct inode *dir, struct dentry *victim,
> > + int isdir)
>
> Why not make 'isdir' a boolean?
>
> I'd prefer to see more documentation above this function: explain what each
> arg does, return value, etc.

Indeed! I added descriptions of the new whiteout and fallthru VFS ops
to Documentation/filesystems/vfs.txt, where it belongs.

> > +/**
> > + * vfs_whiteout: creates a white-out for the given directory entry
> > + * @dir: parent inode
> > + * @dentry: directory entry to white-out
>
> Nit: is it 'white-out' or 'whiteout'? Whatever you choose is fine, but
> please use consistent hypenation/spelling everywhere (code, comments, and
> documentation).

Done, thanks.

> > + *
> > + * Simply white-out a given directory entry. This functionality is usually used
> > + * in the sense of unlink. Therefore the given dentry can still be in-use and
> > + * contains an in-use inode. The filesystem has to do what unlink or rmdir
>
> Nit: other than the line of comment just above, the other two instances of
> "in-use" in this comment (and the patch header) should be changed to "in
> use" (no hyphen).
>
> > + * would in that case. Since the dentry still might be in-use we have to
> > + * provide a fresh unhashed dentry that whiteout can fill the new inode into.
> > + * In that case the given dentry is dropped and the fresh dentry containing the
> > + * whiteout is rehashed instead. If the given dentry is unused, the whiteout
> > + * inode is instantiated into it instead.
> > + *
> > + * After this returns with success, don't make any assumptions about the inode.
>
> What kinds of assumptions one should not make? Perhaps it'd be better to
> document what you can/should assume, instead of what you shouldn't (or
> both?)
>
> > + * Just dput() it dentry.
>
> The last line is awkward: do you mean "its dentry"?

All rewritten to address the above comments.

Thanks,

-VAL

2010-01-21 00:53:35

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 12/41] union-mount: Allow removal of a directory

On Mon, Nov 30, 2009 at 01:13:36AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > do_whiteout() allows removal of a directory when it has whiteouts but
> > is logically empty.
> >
> > XXX - This patch abuses readdir() to check if the union directory is
> > logically empty - that is, all the entries are whiteouts (or "." or
> > ".."). Currently, we have no clean VFS interface to ask the lower
> > file system if a directory is empty.
> >
> > Fixes:
> > - Add ->is_directory_empty() op
> > - Add is_directory_empty flag to dentry (ugly dcache populate)
> > - Ask underlying fs to remove it and look for an error return
> > - (your idea here)
>
> Yeah, this is a difficult issue. I think the best way would be to
>
> 1. add an OPTIONAL ->is_directory_empty() inode op.
>
> 2. have the VFS use some default/generic behavior ala filldir_is_empty()
> below if inode->i_op->is_directory_empty is NULL. I assume this behavior
> will only need to be checked for file systems that support whiteouts in
> the first place.
>
> This'll provide some working behavior for all whiteout-supporting file
> systems, but allow anyone who wants to develop a more efficient method to
> provide one.

I hear you, but I'm reluctant to keep a generic version of
is_directory_empty() because, (1) you have to add support for
whiteouts and fallthrus anyway, you might as well require support for
is_directory_empty() op at the same time, (2) per-fs versions would be
undoubtedly more efficient than bouncing up and down through
readdir(), and (3) it's such an abuse. :)

> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 85 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 5da1635..9a62c75 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2284,6 +2284,91 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> > }
> >
> > /*
> > + * This is abusing readdir to check if a union directory is logically empty.
> > + * Al Viro barfed when he saw this, but Val said: "Well, at this point I'm
> > + * aiming for working, pretty can come later"
> > + */
> > +static int filldir_is_empty(void *__buf, const char *name, int namlen,
> > + loff_t offset, u64 ino, unsigned int d_type)
> > +{
>
> Why not make filldir_is_empty() return a bool? That explains more clearly
> the function's return code.
>
> > +static int directory_is_empty(struct dentry *dentry, struct vfsmount *mnt)
> > +{
>
> This can also return a bool.
>
> > +static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
> > +{
>
> 'isdir' can be bool.

In general, I'm not using bools because it doesn't fit in with the
coding style of the rest of the VFS.

> > + struct path safe = { .dentry = dget(nd->path.dentry),
> > + .mnt = mntget(nd->path.mnt) };
> > + struct dentry *dentry = path->dentry;
> > + int err;
>
> You might want to move the initialization of 'struct path safe' down below,
> and add a BUG_ON(!nd) before that. I think during the development phases of
> UM, it's a good idea to have a few more debugging BUG_ON's.

I'd rather get rid of the need for struct path safe entirely...

-VAL

2010-01-21 02:03:35

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 13/41] whiteout: tmpfs whiteout support

On Mon, Nov 30, 2009 at 01:26:53AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > Add support for whiteout dentries to tmpfs.
>
> Shouldn't you CC Hugh Dickins here? He's probably best positioned to review
> the changes in mm/shmem.c.

Thanks, I added him and linux-mm.

> > XXX - Not sure this is the right patch to put the code for supporting
> > whiteouts in d_genocide().
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/dcache.c | 3 +-
> > mm/shmem.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> > 2 files changed, 137 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 0fcae4b..1fae1df 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -2280,7 +2280,8 @@ resume:
> > struct list_head *tmp = next;
> > struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> > next = tmp->next;
> > - if (d_unhashed(dentry)||!dentry->d_inode)
> > + if (d_unhashed(dentry)||(!dentry->d_inode &&
> > + !d_is_whiteout(dentry)))
>
> I think this d_genocide patch should go elsewhere. What does it have to do
> with tmpfs?

Without this patch, you can't unmount a tmpfs file system with
whiteouts. d_genocide() is called by kill_litter_super() to evict all
the dcache entries used by tmpfs.

> Also, is your logic above correct? If I understood d_genocide correctly,
> then the code you changed attempts to skip over dentries for which
> d_genocide has no work to do, like unhashed and negative dentries. So I
> assume it should also skip over whiteout dentries. Your condition is
>
> if (d_unhashed(dentry) || (!dentry->d_inode && !d_is_whiteout(dentry)))
>
> but perhaps it needs to be
>
> if (d_unhashed(dentry) || !dentry->d_inode || d_is_whiteout(dentry))
>
> No?
>
> Either way, you may want to document any complex conditional that may be
> confusing to parse.

This is a good thing to document. What we're dealing with here is
dropping the ref count on persistent dentries. How about this comment?

/*
* Skip unhashed and negative dentries, but process
* positive dentries and whiteouts. A whiteout looks
* kind of like a negative dentry for purposes of
* lookup, but it has an extra pinning ref count
* because it can't be evicted like a negative dentry
* can. What we care about here is ref counts - and
* we need to drop the ref count on a whiteout before
* we can evict it.
*/
if (d_unhashed(dentry)||(!dentry->d_inode &&
!d_is_whiteout(dentry)))
continue;

-VAL

2010-01-26 19:52:42

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 16/41] whiteout: jffs2 whiteout support

On Mon, Nov 30, 2009 at 02:51:05AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Felix Fietkau <[email protected]>
> >
> > Add support for whiteout dentries to jffs2.
> >
> > Signed-off-by: Felix Fietkau <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > Cc: David Woodhouse <[email protected]>
> > Cc: [email protected]
> > ---
> > fs/jffs2/dir.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++-
> > fs/jffs2/fs.c | 4 ++
> > fs/jffs2/super.c | 2 +-
> > include/linux/jffs2.h | 2 +
> > 4 files changed, 82 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> > index 6f60cc9..46a2e1b 100644
> > --- a/fs/jffs2/dir.c
> > +++ b/fs/jffs2/dir.c
> > @@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
> > static int jffs2_rename (struct inode *, struct dentry *,
> > struct inode *, struct dentry *);
> >
> > +static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
> > +
> > const struct file_operations jffs2_dir_operations =
> > {
> > .read = generic_read_dir,
> > @@ -55,6 +57,7 @@ const struct inode_operations jffs2_dir_inode_operations =
> > .rmdir = jffs2_rmdir,
> > .mknod = jffs2_mknod,
> > .rename = jffs2_rename,
> > + .whiteout = jffs2_whiteout,
> > .permission = jffs2_permission,
> > .setattr = jffs2_setattr,
> > .setxattr = jffs2_setxattr,
> > @@ -98,8 +101,18 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
> > fd = fd_list;
> > }
> > }
> > - if (fd)
> > - ino = fd->ino;
> > + if (fd) {
> > + spin_lock(&target->d_lock);
> > + switch(fd->type) {
> > + case DT_WHT:
> > + target->d_flags |= DCACHE_WHITEOUT;
> > + break;
> > + default:
> > + ino = fd->ino;
> > + break;
> > + }
> > + spin_unlock(&target->d_lock);
> > + }
>
> The switch statement above should be simplified into this:
>
> if (fd->type == DT_WHT)
> target->d_flags |= DCACHE_WHITEOUT;
> else
> ino = fd->ino;

This is because later we add a third case for fallthrus, at which
point a switch statement is easier to read. But it is confusing and
distracting by itself in this patch, so I changed it as you suggested.


> > + /* If it's a directory, then check whether it is really empty
> > + */
>
> Format above comment on one line.

Fixxed, thanks.

-VAL

2010-01-26 20:02:49

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 17/41] whiteout: Add path_whiteout() helper

On Mon, Nov 30, 2009 at 02:57:30AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > Add a path_whiteout() helper for vfs_whiteout().
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/namei.c | 15 ++++++++++++++-
> > include/linux/fs.h | 1 -
> > 2 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 9a62c75..408380d 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2231,7 +2231,7 @@ static inline int may_whiteout(struct inode *dir, struct dentry *victim,
> > * After this returns with success, don't make any assumptions about the inode.
> > * Just dput() it dentry.
> > */
> > -int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> > +static int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
>
> Didn't some other patch introduce vfs_whiteout? So why have a second patch
> which makes vfs_whiteout a static? Why not introduce both vfs_whiteout and
> path_whiteout in one patch?

You're right, I merged those patches.

> > {
> > int err;
> > struct inode *old_inode = dentry->d_inode;
> > @@ -2283,6 +2283,19 @@ int vfs_whiteout(struct inode *dir, struct dentry *dentry, int isdir)
> > return err;
> > }
> >
> > +int path_whiteout(struct path *dir_path, struct dentry *dentry, int isdir)
>
> Please document the behavior of path_whiteout in a proper comment above ii
> (kernel-doc). Describe return values, side effects, etc.

Superseded by improved vfs_whiteout() documentation.

Thanks,

-VAL

2010-01-26 20:03:59

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 19/41] union-mount: Introduce MNT_UNION and MS_UNION flags

On Mon, Nov 30, 2009 at 03:02:44AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > Add per mountpoint flag for Union Mount support. You need additional patches
> > to util-linux for that to work - see:
> >
> > git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > ---
> > fs/namespace.c | 5 ++++-
> > include/linux/fs.h | 1 +
> > include/linux/mount.h | 1 +
> > 3 files changed, 6 insertions(+), 1 deletions(-)
> [...]
>
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 5d52753..e175c47 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -35,6 +35,7 @@ struct mnt_namespace;
> > #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
> > #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
> > #define MNT_PNODE_MASK 0x3000 /* propagation flag mask */
> > +#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */
>
> I it correct to just add another flag here? How does it relate to this
> 'propagation mask' right above it? If there's some code out there which
> masks out which MNT flags get propagated and which don't, then you need to
> make a decision whether MNT_UNION needs to be propagated as well. Either
> way, please document your decision in a comment here so no one will have to
> ask the same question again.

I sat down and puzzled this out and sent a separate patch to clean up
and comment this part of the code.

-VAL

2010-01-26 22:38:37

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 20/41] union-mount: Introduce union_mount structure

On Mon, Nov 30, 2009 at 03:46:29AM -0500, Erez Zadok wrote:
> In message <[email protected]>, Valerie Aurora writes:
> > From: Jan Blunck <[email protected]>
> >
> > This patch adds the basic structures of VFS based union mounts. It is a new
> > implementation based on some of my old ideas that influenced Bharata B Rao
> > <[email protected]> who came up with the proposal to let the
> > union_mount struct only point to the next layer in the union stack. I rewrote
> > nearly all of the central patches around lookup and the dcache interaction.
> >
> > Advantages of the new implementation:
> > - the new union stack is no longer tied directly to one dentry
> > - the union stack enables dentries to be part of more than one union
> > (bind mounts)
> > - it is unnecessary to traverse the union stack when de/referencing a dentry
> > - caching of union stack information still driven by dentry cache
> >
> > XXX - is_unionized() is pretty heavy-weight for non-union file systems
> > on a union mount-enabled kernel. May be simplified by assuming one or
> > more of:
> >
> > - Two layers only
> > - One-to-one association between layers (doesn't union submounts)
> > - Writable layer mounted in only one place
>
> Yes, is_unionized() does appear to be heavy. Is it correct to assume that
> every such dentry will have gotten looked up or traversed as part of a
> union? If so, can we just set a flag in the dentry to mark it as
> D_THIS_IS_PART_OF_A_UNION? Even if you could, what happens when a union r-w
> layer is removed: could there be leftover dentries marked as part of a
> union, which are no longer really part of it?

dentries aren't themselves part of a unioned file system - they are
shared among all the mounts of a superblock/device. A dentry is
unioned or not only in the context of a particular mount. So we can't
mark the dentry itself since it will be unioned in one mount (say,
/mnt/union) and not in another (/mnt/ro). A dentry can be looked up
in another mount of one of the components of the union before the
union mount is created, so we can't mark a dentry at lookup anyway.

The vfsmount for the top layer <dentry,mnt> pair is marked with
MNT_UNION, but the vfsmount for the bottom layer is not. I think we
could mark the lower vfsmount with a flag (probably not MNT_UNION but
MNT_UNION_LOWER or something like that) and check for that. Al? Jan?
Christoph?

-VAL