LinuxLists.cc - [PATCH 04/35] whiteout/NFSD: Don't return information about whiteouts to userspace

2010-04-15 23:14:26

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 04/35] whiteout/NFSD: Don't return information about whiteouts to userspace

From: Jan Blunck <[email protected]>

Userspace isn't ready for handling another file type, so silently drop
whiteout directory entries before they leave the kernel.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: [email protected]
Cc: "J. Bruce Fields" <[email protected]>
Cc: Neil Brown <[email protected]>
---
fs/compat.c | 9 +++++++++
fs/nfsd/nfs3xdr.c | 5 +++++
fs/nfsd/nfs4xdr.c | 5 +++++
fs/nfsd/nfsxdr.c | 4 ++++
fs/readdir.c | 9 +++++++++
5 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 00d90c2..624e1a5 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -838,6 +838,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
struct compat_old_linux_dirent __user *dirent;
compat_ulong_t d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -909,6 +912,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
compat_ulong_t d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -998,6 +1004,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
u64 off;

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2a533a0..9b96f5a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
int elen; /* estimated entry length in words */
int num_entry_words = 0; /* actual number of words */

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset) {
u64 offset64 = offset;

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 78c7e24..8839ba8 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2268,6 +2268,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
return 0;
}

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset)
xdr_encode_hyper(cd->offset, (u64) offset);

diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 4ce005d..0e57d4b 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
namlen, name, offset, ino);
*/

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
if (offset > ~((u32) 0)) {
cd->common.err = nfserr_fbig;
return -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..3a48491 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
struct old_linux_dirent __user * dirent;
unsigned long d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
unsigned long d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
--
1.6.3.3

2010-04-15 23:05:57

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 06/35] whiteout: Set S_OPAQUE inode flag when creating directories

From: Jan Blunck <[email protected]>

In case of an union directory we don't want that the directories on lower
layers of the union "show through". So to prevent that the contents of
underlying directories magically shows up after a mkdir() we set the S_OPAQUE
flag if directories are created where a whiteout existed before.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 11 ++++++++++-
include/linux/fs.h | 3 +++
2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 010927b..956083a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2104,6 +2104,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev)
int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int error = may_create(dir, dentry);
+ int opaque = 0;

if (error)
return error;
@@ -2116,9 +2117,17 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
if (error)
return error;

+ if (d_is_whiteout(dentry))
+ opaque = 1;
+
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
+ if (!error) {
fsnotify_mkdir(dir, dentry);
+ if (opaque) {
+ dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(dentry->d_inode);
+ }
+ }
return error;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21102f9..a9f747c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_OPAQUE 1024 /* Directory is opaque */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -271,6 +272,8 @@ struct inodes_stat_t {
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)

+#define IS_OPAQUE(inode) ((inode)->i_flags & S_OPAQUE)
+
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */

--
1.6.3.3

2010-04-15 23:06:08

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 07/35] whiteout: Allow removal of a directory with whiteouts

From: Jan Blunck <[email protected]>

do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.

XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
".."). Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.

Fixes:
- Add ->is_directory_empty() op
- Add is_directory_empty flag to dentry (ugly dcache populate)
- Ask underlying fs to remove it and look for an error return
- (your idea here)

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 956083a..991767b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2307,6 +2307,94 @@ int path_whiteout(struct path *dir_path, struct dentry *dentry, int isdir)
EXPORT_SYMBOL(path_whiteout);

/*
+ * XXX - We are abusing readdir to check if a union directory is
+ * logically empty.
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ int *is_empty = (int *)__buf;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ (*is_empty) = 0;
+ return 0;
+}
+
+static int directory_is_empty(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct file *file;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+ /* references for the file pointer */
+ dget(dentry);
+ mntget(mnt);
+
+ file = dentry_open(dentry, mnt, O_RDONLY, current_cred());
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
+{
+ struct path safe = { .dentry = dget(nd->path.dentry),
+ .mnt = mntget(nd->path.mnt) };
+ struct dentry *dentry = path->dentry;
+ int err;
+
+ err = may_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+ if (err)
+ goto out;
+
+ err = -ENOENT;
+ if (!dentry->d_inode)
+ goto out;
+
+ err = -ENOTEMPTY;
+ if (isdir && !directory_is_empty(path->dentry, path->mnt))
+ goto out;
+
+ if (nd->path.dentry != dentry->d_parent) {
+ dentry = __lookup_hash(&path->dentry->d_name, nd->path.dentry,
+ nd);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out;
+
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
+
+ err = vfs_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+
+out:
+ path_put(&safe);
+ return err;
+}
+
+/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning
--
1.6.3.3

2010-04-15 23:06:20

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 08/35] whiteout: tmpfs whiteout support

From: Jan Blunck <[email protected]>

Add support for whiteout dentries to tmpfs. This includes adding
support for whiteouts to d_genocide(), which is called to tear down
pinned tmpfs dentries. Whiteouts have to be persistent, so they have
a pinning extra ref count that needs to be dropped by d_genocide().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
---
fs/dcache.c | 13 +++++-
mm/shmem.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 147 insertions(+), 15 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265015d..3b0e525 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2229,7 +2229,18 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- if (d_unhashed(dentry)||!dentry->d_inode)
+ /*
+ * Skip unhashed and negative dentries, but process
+ * positive dentries and whiteouts. A whiteout looks
+ * kind of like a negative dentry for purposes of
+ * lookup, but it has an extra pinning ref count
+ * because it can't be evicted like a negative dentry
+ * can. What we care about here is ref counts - and
+ * we need to drop the ref count on a whiteout before
+ * we can evict it.
+ */
+ if (d_unhashed(dentry)||(!dentry->d_inode &&
+ !d_is_whiteout(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/mm/shmem.c b/mm/shmem.c
index eef4ebe..c58ecf4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1805,6 +1805,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}

+static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
+static int shmem_unlink(struct inode *dir, struct dentry *dentry);
+
+/*
+ * This is the whiteout support for tmpfs. It uses one singleton whiteout
+ * inode per superblock thus it is very similar to shmem_link().
+ */
+static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+ struct dentry *dentry;
+
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ /* This gives us a proper initialized negative dentry */
+ dentry = simple_lookup(dir, new_dentry, NULL);
+ if (dentry && IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ /*
+ * No ordinary (disk based) filesystem counts whiteouts as inodes;
+ * but each new link needs a new dentry, pinning lowmem, and
+ * tmpfs dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ if (old_dentry->d_inode) {
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ shmem_rmdir(dir, old_dentry);
+ else
+ shmem_unlink(dir, old_dentry);
+ }
+
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ /* Extra pinning count for the created dentry */
+ dget(new_dentry);
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode)
+{
+ if (d_is_whiteout(dentry)) {
+ /* Re-using an existing whiteout */
+ shmem_free_inode(dir->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ inode->i_mode |= S_OPAQUE;
+ } else {
+ /* New dentry */
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dget(dentry); /* Extra count - pin the dentry in core */
+ }
+ /* Will clear DCACHE_WHITEOUT flag */
+ d_instantiate(dentry, inode);
+
+}
/*
* File creation. Allocate an inode, and we're done..
*/
@@ -1838,10 +1908,10 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
if (S_ISDIR(mode))
inode->i_mode |= S_ISGID;
}
- dir->i_size += BOGO_DIRENT_SIZE;
+
+ shmem_d_instantiate(dir, dentry, inode);
+
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry); /* Extra count - pin the dentry in core */
}
return error;
}
@@ -1879,12 +1949,11 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
if (ret)
goto out;

- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
+
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
atomic_inc(&inode->i_count); /* New dentry reference */
- dget(dentry); /* Extra pinning count for the created dentry */
- d_instantiate(dentry, inode);
out:
return ret;
}
@@ -1893,21 +1962,61 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
- shmem_free_inode(inode->i_sb);
+ if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ shmem_free_inode(dir->i_sb);

+ if (inode) {
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ drop_nlink(inode);
+ }
dir->i_size -= BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
return 0;
}

+static void shmem_dir_unlink_whiteouts(struct inode *dir, struct dentry *dentry)
+{
+ if (!dentry->d_inode)
+ return;
+
+ /* Remove whiteouts from logical empty directory */
+ if (S_ISDIR(dentry->d_inode->i_mode) &&
+ dentry->d_inode->i_sb->s_flags & MS_WHITEOUT) {
+ struct dentry *child, *next;
+ LIST_HEAD(list);
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+ spin_lock(&child->d_lock);
+ if (d_is_whiteout(child)) {
+ __d_drop(child);
+ if (!list_empty(&child->d_lru)) {
+ list_del(&child->d_lru);
+ dentry_stat.nr_unused--;
+ }
+ list_add(&child->d_lru, &list);
+ }
+ spin_unlock(&child->d_lock);
+ }
+ spin_unlock(&dcache_lock);
+
+ list_for_each_entry_safe(child, next, &list, d_lru) {
+ spin_lock(&child->d_lock);
+ list_del_init(&child->d_lru);
+ spin_unlock(&child->d_lock);
+
+ shmem_unlink(dentry->d_inode, child);
+ }
+ }
+}
+
static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
if (!simple_empty(dentry))
return -ENOTEMPTY;

+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(dir, dentry);
drop_nlink(dentry->d_inode);
drop_nlink(dir);
return shmem_unlink(dir, dentry);
@@ -1916,7 +2025,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
/*
* The VFS layer already does all the dentry stuff for rename,
* we just have to decrement the usage count for the target if
- * it exists so that the VFS layer correctly free's it when it
+ * it exists so that the VFS layer correctly frees it when it
* gets overwritten.
*/
static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
@@ -1927,7 +2036,12 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
if (!simple_empty(new_dentry))
return -ENOTEMPTY;

+ if (d_is_whiteout(new_dentry))
+ shmem_unlink(new_dir, new_dentry);
+
if (new_dentry->d_inode) {
+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(new_dir, new_dentry);
(void) shmem_unlink(new_dir, new_dentry);
if (they_are_dirs)
drop_nlink(old_dir);
@@ -1992,12 +2106,12 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s
unlock_page(page);
page_cache_release(page);
}
+
+ shmem_d_instantiate(dir, dentry, inode);
+
if (dir->i_mode & S_ISGID)
inode->i_gid = dir->i_gid;
- dir->i_size += BOGO_DIRENT_SIZE;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry);
return 0;
}

@@ -2375,6 +2489,12 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
if (!root)
goto failed_iput;
sb->s_root = root;
+
+#ifdef CONFIG_TMPFS
+ if (!(sb->s_flags & MS_NOUSER))
+ sb->s_flags |= MS_WHITEOUT;
+#endif
+
return 0;

failed_iput:
@@ -2475,6 +2595,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.rmdir = shmem_rmdir,
.mknod = shmem_mknod,
.rename = shmem_rename,
+ .whiteout = shmem_whiteout,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2010-04-15 23:06:27

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 09/35] whiteout: Split of ext2_append_link() from ext2_add_link()

From: Jan Blunck <[email protected]>

The ext2_append_link() is later used to find or append a directory
entry to whiteout.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 70 ++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..57207a9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
}

/*
- * Parent is locked.
+ * Find or append a given dentry to the parent directory
*/
-int ext2_add_link (struct dentry *dentry, struct inode *inode)
+static ext2_dirent * ext2_append_entry(struct dentry * dentry,
+ struct page ** page)
{
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
@@ -482,13 +483,10 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
unsigned chunk_size = ext2_chunk_size(dir);
unsigned reclen = EXT2_DIR_REC_LEN(namelen);
unsigned short rec_len, name_len;
- struct page *page = NULL;
- ext2_dirent * de;
+ ext2_dirent * de = NULL;
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
- loff_t pos;
- int err;

/*
* We take care of directory expansion in the same loop.
@@ -498,20 +496,19 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
for (n = 0; n <= npages; n++) {
char *dir_end;

- page = ext2_get_page(dir, n, 0);
- err = PTR_ERR(page);
- if (IS_ERR(page))
+ *page = ext2_get_page(dir, n, 0);
+ de = ERR_PTR(PTR_ERR(*page));
+ if (IS_ERR(*page))
goto out;
- lock_page(page);
- kaddr = page_address(page);
+ lock_page(*page);
+ kaddr = page_address(*page);
dir_end = kaddr + ext2_last_byte(dir, n);
de = (ext2_dirent *)kaddr;
kaddr += PAGE_CACHE_SIZE - reclen;
while ((char *)de <= kaddr) {
if ((char *)de == dir_end) {
/* We hit i_size */
- name_len = 0;
- rec_len = chunk_size;
+ de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
@@ -519,12 +516,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
if (de->rec_len == 0) {
ext2_error(dir->i_sb, __func__,
"zero-length directory entry");
- err = -EIO;
+ de = ERR_PTR(-EIO);
goto out_unlock;
}
- err = -EEXIST;
if (ext2_match (namelen, name, de))
- goto out_unlock;
+ goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
@@ -533,13 +529,48 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
goto got_it;
de = (ext2_dirent *) ((char *) de + rec_len);
}
- unlock_page(page);
- ext2_put_page(page);
+ unlock_page(*page);
+ ext2_put_page(*page);
}
+
BUG();
- return -EINVAL;

got_it:
+ return de;
+ /* OFFSET_CACHE */
+out_unlock:
+ unlock_page(*page);
+ ext2_put_page(*page);
+out:
+ return de;
+}
+
+/*
+ * Parent is locked.
+ */
+int ext2_add_link (struct dentry *dentry, struct inode *inode)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+got_it:
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
@@ -563,7 +594,6 @@ got_it:
/* OFFSET_CACHE */
out_put:
ext2_put_page(page);
-out:
return err;
out_unlock:
unlock_page(page);
--
1.6.3.3

2010-04-15 23:06:42

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 15/35] fallthru: tmpfs fallthru support

Add support for fallthru directory entries to tmpfs

XXX - Makes up inode number for dirent

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 3 +-
fs/libfs.c | 21 +++++++++++++++++--
mm/shmem.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b76f9e4..1575af4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2240,7 +2240,8 @@ resume:
* we can evict it.
*/
if (d_unhashed(dentry)||(!dentry->d_inode &&
- !d_is_whiteout(dentry)))
+ !d_is_whiteout(dentry) &&
+ !d_is_fallthru(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/fs/libfs.c b/fs/libfs.c
index 9e50bcf..cb24772 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -133,6 +133,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *cursor = filp->private_data;
struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
+ int d_type;
int i = filp->f_pos;

switch (i) {
@@ -158,14 +159,28 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (d_unhashed(next) || !next->d_inode)
+ if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
continue;

+ if (d_is_fallthru(next)) {
+ /* XXX We don't know the inode
+ * number of the directory
+ * entry in the underlying
+ * file system. Should look
+ * it up, either on fallthru
+ * creation at first readdir
+ * or now at filldir time. */
+ ino = 123; /* Made up ino */
+ d_type = DT_UNKNOWN;
+ } else {
+ ino = next->d_inode->i_ino;
+ d_type = dt_type(next->d_inode);
+ }
+
spin_unlock(&dcache_lock);
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
- next->d_inode->i_ino,
- dt_type(next->d_inode)) < 0)
+ ino, d_type) < 0)
return 0;
spin_lock(&dcache_lock);
/* next is still alive */
diff --git a/mm/shmem.c b/mm/shmem.c
index c58ecf4..163957b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1809,8 +1809,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
static int shmem_unlink(struct inode *dir, struct dentry *dentry);

/*
- * This is the whiteout support for tmpfs. It uses one singleton whiteout
- * inode per superblock thus it is very similar to shmem_link().
+ * Create a dentry to signify a whiteout.
*/
static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
@@ -1841,8 +1840,10 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
spin_unlock(&sbinfo->stat_lock);
}

- if (old_dentry->d_inode) {
- if (S_ISDIR(old_dentry->d_inode->i_mode))
+ if (old_dentry->d_inode || d_is_fallthru(old_dentry)) {
+ /* A fallthru for a dir is treated like a regular link */
+ if (old_dentry->d_inode &&
+ S_ISDIR(old_dentry->d_inode->i_mode))
shmem_rmdir(dir, old_dentry);
else
shmem_unlink(dir, old_dentry);
@@ -1859,6 +1860,48 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
}

static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode);
+
+/*
+ * Create a dentry to signify a fallthru. A fallthru in tmpfs is the
+ * logical equivalent of an in-kernel readdir() cache. It can't be
+ * deleted until the file system is unmounted.
+ */
+static int shmem_fallthru(struct inode *dir, struct dentry *dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+
+ /* FIXME: this is stupid */
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ if (dentry->d_inode || d_is_fallthru(dentry) || d_is_whiteout(dentry))
+ return -EEXIST;
+
+ /*
+ * Each new link needs a new dentry, pinning lowmem, and tmpfs
+ * dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ shmem_d_instantiate(dir, dentry, NULL);
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
if (d_is_whiteout(dentry)) {
@@ -1866,14 +1909,15 @@ static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
shmem_free_inode(dir->i_sb);
if (S_ISDIR(inode->i_mode))
inode->i_mode |= S_OPAQUE;
+ } else if (d_is_fallthru(dentry)) {
+ shmem_free_inode(dir->i_sb);
} else {
/* New dentry */
dir->i_size += BOGO_DIRENT_SIZE;
dget(dentry); /* Extra count - pin the dentry in core */
}
- /* Will clear DCACHE_WHITEOUT flag */
+ /* Will clear DCACHE_WHITEOUT and DCACHE_FALLTHRU flags */
d_instantiate(dentry, inode);
-
}
/*
* File creation. Allocate an inode, and we're done..
@@ -1962,7 +2006,8 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ if (d_is_whiteout(dentry) || d_is_fallthru(dentry) ||
+ (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
shmem_free_inode(dir->i_sb);

if (inode) {
@@ -2596,6 +2641,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.mknod = shmem_mknod,
.rename = shmem_rename,
.whiteout = shmem_whiteout,
+ .fallthru = shmem_fallthru,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2010-04-15 23:07:01

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

Document design and implementation of writable overlays (a.k.a. union
mounts).

XXX - out of date

Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/union-mounts.txt | 708 ++++++++++++++++++++++++++++
1 files changed, 708 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..5f47296
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,708 @@
+State of writable overlays (formerly union mounts)
+==================================================
+
+This version of union mounts is renamed "writable overlays." The goal
+of this patch set is to support a single read-write file system
+overlaid on a single read-only file system. "Union mounts" suggests
+that we support unions of arbitrary numbers and types of file systems,
+which is not the goal of this patch set.
+
+The most recent version of writable overlays can boot to multi-user
+mode with a writable overlay root file system. open(), truncate(),
+creat(), unlink(), mkdir(), rmdir(), and rename() work. link(),
+chmod(), chown(), and chattr() don't work yet.
+
+This document describes the architecture and current status of
+writable overlays, including an item-by-item todo list.
+
+Writable overlays (formerly union mounts)
+=========================================
+
+In this document:
+ - Overview of writable overlays
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to writable overlays
+
+Overview
+========
+
+Writable overlays (formerly known as union mounts) are used to layer a
+single writable file system over a single read-only file system, with
+all writes going to the writable file system. The namespace of both
+file systems appears as a combined whole to userland, with those on
+the writable file system covering up any matching pathnames on the
+read-only file system. A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device; however, sharing at the file system level has higher
+performance and uses less disk space.
+
+What writable overlays are not
+------------------------------
+
+Writable overlays are not a general-purpose unioning file system.
+They do not provide a generic "union of namespaces" operation for an
+arbitrary number of file systems. Many interesting features can be
+implemented with a generic unioning facility: unioning of more than
+two file systems, dynamic insertion and removal of branches, online
+upgrade, etc. Some unioning file systems that do this are UnionFS and
+AUFS. Unfortunately, the complexity of these feature sets lead to
+difficult corner cases which so far have been unsolvable in the
+context of the Linux VFS.
+
+Writable overlays avoid these corner cases by reducing the feature set
+to the bare minimum most requested features: one writable file system
+layered over one read-only file system. Despite the limitations of
+writable overlays, the VFS infrastructure it uses are generic enough
+to be reused by more full-featured unioning file systems.
+
+Terminology
+===========
+
+The main analogy for writable overlays is that a writable file system
+is mounted "on top" of a read-only file system. Lookups start at the
+"top" read-write file system and travel "down" to the "bottom"
+read-only file system only if no blocking entry exists on the top
+layer.
+
+Top layer: The read-write file system. Lookups begin here.
+
+Bottom layer: The read-only file system. Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding
+path on the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding
+path on the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups
+from travelling down to the bottom layer. Created on unlink()/rmdir()
+if a corresponding directory entry exists in the bottom layer.
+
+Opaque: A flag on a directory in the top layer that prevents lookups
+of entries in this directory from travelling down to the bottom
+layer (unless there is an explicit fallthru entry allowing that for a
+particular entry). Set on creation of a directory that replaces a
+whiteout, and after a directory copyup.
+
+Fallthru: A directory entry which allows lookups to "fall through" to
+the bottom layer for that exact directory entry. This serves as a
+placeholder for directory entries from the bottom layer during
+readdir(). Fallthrus override opaque flags.
+
+File copyup: Create a file on the top layer that has the same properties
+and contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the
+bottom layer as fallthrus in the matching top layer directory. Mark
+the directory opaque to avoid unnecessary negative lookups on the
+bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on top layer
+- unlink() /oldfile -> creates a whiteout on top layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to top layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to top layer
+- mkdir() /newdir -> creates on top layer
+- rmdir() /olddir -> creates a whiteout on top layer
+- mkdir() /olddir after above -> creates on top layer w/ opaque flag
+- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
+- rename() dir -> EXDEV
+
+Getting to a root file system with a writable overlay:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the writable overlay on /newroot:
+ # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Writable overlays are implemented as an integral part of the VFS,
+rather than as a VFS client file system (i.e., a stacked file system
+like unionfs or ecryptfs). Implementing writable overlays inside the
+VFS eliminates the need for duplicate copies of VFS data structures,
+unnecessary indirection, and code duplication, but requires very
+maintainable, low-to-zero overhead code. Writable overlays require no
+change to file systems serving as the read-only layer, and requires
+some minor support from file systems serving as the read-write layer.
+File systems that want to be the writable layer must implement the new
+->whiteout() and ->fallthru() inode operations, which create special
+dummy directory entries.
+
+union_mount structure
+---------------------
+
+The primary data structure for writable overlays is the union_mount
+structure, which connects overlapping directory dentries into a "union
+stack":
+
+struct union_mount {
+ atomic_t u_count; /* reference count */
+ struct mutex u_mutex;
+ struct list_head u_unions; /* list head for d_unions */
+ struct list_head u_list; /* list head for mnt_unions */
+ struct hlist_node u_hash; /* list head for searching */
+ struct hlist_node u_rhash; /* list head for reverse searching */
+
+ struct path u_this; /* this is me */
+ struct path u_next; /* this is what I overlay */
+};
+
+The union_mount is referenced from the corresponding directory's
+dentry:
+
+struct dentry {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * The following fields are used by the VFS based union mount
+ * implementation. Both are protected by union_lock!
+ */
+ struct list_head d_unions; /* list of union_mounts */
+ unsigned int d_unionized; /* unions referencing this dentry */
+#endif
+[...]
+};
+
+Each top layer directory with the potential for a lookup to fall
+through to the bottom layer has a union_mount structure stored in a
+union_mount hash table. The union_mount's can be looked up both by the
+top layer's path (via union_lookup()) and the bottom layer's path (via
+union_rlookup()). Once you have the path (vfsmount and dentry pair)
+of a file, the union stack can be followed down, layer by layer, with
+follow_union_down(), and up with follow_union_mount().
+
+All union_mount's are allocated from a kmem cache when the
+corresponding dentries are created. union_mount's are allocated when
+the first referencing dentry is allocated and freed when all of the
+referencing dentries are freed - that is, the dcache drives the union
+cache. While writable overlays only use two layers, the union stack
+infrastructure is capable of supporting an arbitrary number of file
+system layers (leaving aside locking issues).
+
+Todo:
+
+- Rename union_mount structure - it's per directory, not per mount
+
+Code paths
+----------
+
+Writable overlays modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Path lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Writable overlays are created in two steps:
+
+1. Mount the bottom layer file system read-only in the usual manner.
+2. Mount the top layer with the "-o union" option at the same mountpoint.
+
+The bottom layer must be read-only and the top layer must be
+read-write and support whiteouts and fallthrus (indicated by setting
+the MS_WHITEOUT flag). Currently, the top layer is forced to
+"noatime" to avoid a copyup on every access of a file. Supporting
+atime with the current infrastructure would require a copyup on every
+open().
+
+Currently, the top layer covers all submounts on the read-only file
+system. This can be inconvenient; e.g., mounting a writable overlay
+on the root file system after procfs has been mounted. It's not clear
+what the right behavior is. Also, it may be smarter to mount both
+read-only and read-write layers in one step, but the mount options get
+pretty ugly.
+
+pivot_root() is supported and is the recommended way to get to a root
+file system with a writable overlay.
+
+Todo:
+
+- Rename "-o union" mount option - "overlay"?
+- Don't permit mounting over read-write submounts
+- Choose submount covering behavior
+- Allow atime?
+
+Really really read-only file systems: In Linux, any individual file
+system may be mounted at multiple places in the namespace. The file
+system may change from read-only to read-write while still mounted.
+Thus, simply checking that the bottom layer is read-only at the time
+the writable overlay is mounted over it is pointless, since at any
+time the bottom layer may become read-write.
+
+We need to guarantee that a file system will be read-only for as long
+as it is the bottom layer of a writable overlay. To do this, we track
+the number of "read-only users" of a file system in its VFS superblock
+structure. When we mount a writable overlay over a file system, we
+increment its read-only user count. The file system can only be
+mounted read-write if its read-only users count is zero.
+
+Todo:
+
+- Support really really read-only NFS mounts. See discussion here:
+
+ http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Path lookup
+-----------
+
+Much of the action in writable overlasy happens during lookup().
+First, if we lookup a directory on the bottom layer that doesn't yet
+exist on the top layer, __link_path_walk() always create a matching
+directory on the top layer. This way, we never have to walk back up a
+path, creating directories as we go, before we can copyup a file.
+Second, if we need to copy up a file, we first (re)look it up with the
+LOOKUP_TOPMOST flag, which instructs __link_path_walk() to create it
+on the top layer. Neither directory entries nor file data are copied
+up in __link_path_walk() - that happens after the lookup, in the
+caller.
+
+The main cut-out to writable overlay code is in do_lookup():
+
+static int do_lookup(struct nameidata *nd, struct qstr *name,
+ struct path *path)
+{
+ int err;
+
+ if (IS_MNT_UNION(nd->path.mnt))
+ goto need_union_lookup;
+[...]
+need_union_lookup:
+ err = cache_lookup_union(nd, name, path);
+ if (!err && path->dentry)
+ goto done;
+
+ err = real_lookup_union(nd, name, path);
+ if (err)
+ goto fail;
+ goto done;
+
+cache_lookup_union() looks for the dentry in the dcache, starting at
+the top layer and following down. If it finds nothing, it returns a
+negative dentry from the top layer. If it finds a directory, it looks
+for the same directory in the bottom layer; if that exists, it
+allocates a union_mount struct and hangs the bottom layer dentry off
+of it. real_lookup_union() does the same for uncached entries.
+
+Todo:
+
+- Reorganize cache/hash/real lookup code - lots of code duplication
+- Turn create-on-topmost test into #ifdef'able function
+- Rewrite with assumption that topmost directory always exists
+- Remove duplicated tests and other duplicated code
+
+File copyup
+-----------
+
+Any system call that alters an existing file on the bottom layer
+(including creating or moving a hard link to it) will trigger a copyup
+of the target file to the top layer (via union_copyup() or
+__union_copyup()). This includes:
+
+ - open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT)
+ - truncate()/ftruncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chattr()
+
+Copyup of a file DOES NOT occur on:
+
+ - open(O_RDONLY) if noatime
+ - stat() if no atime
+ - creat()/mkdir()/mknod()
+ - symlink()
+ - unlink()/rmdir()
+
+From an application's point of view, the result of an in-kernel file
+copyup is the logical equivalent of another application updating the
+file via the rename() pattern: creat() a new file, copy the data over,
+make changes the copy, and rename() over the old version. Any
+existing open file descriptors for that file (including those in the
+same application) refer to a now invisible and unreferenced object
+that used to have the same pathname. Only opens that occur after the
+copyup will see updates to the file.
+
+Todo:
+
+- copyup on chown()/chmod()/chattr()
+- copyup if atime is enabled?
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed
+in a system call before copying a file up to avoid unnecessary IO. At
+present, the permission check for a single system call may be spread
+out over many hundreds of lines of code (e.g., open()). In order to
+check permissions, we occasionally need to determine if there is a
+writable overlay on top of this inode. This requires a full path, but
+often we only have the inode at this point. In particular,
+inode_permission() returns EROFS if the inode is on a read-only file
+system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+Another trouble-maker is may_open(), which both checks permissions for
+open AND truncates the file if O_TRUNC is specified. It doesn't make
+any sense to copy up the file and then let may_open() truncate it, but
+we can't copy it after may_open() truncates it either. The current
+ugly hack is to pass the full nameidata to may_open() and copyup
+inside may_open().
+
+Some solutions:
+
+- Create __inode_permission() and pass it a flag telling it whether or
+ not to check for a read-only fs. Create union_permission() which
+ takes a path, checks for a union mount, and sets the rofs flag.
+ Place the file copyup call after all the permission checks are
+ completed. Push down the full path into the functions that need it
+ and currently only take the dentry or inode.
+
+- For each instance in which we might want to copyup, move permission
+ checks into a new function and call it from a level at which we
+ still have the full path. Pass it an "ignore read-only fs" flag if
+ the file is on a union mount. Pass around the ignore-rofs flag
+ inside the function doing permission checks. If all the permission
+ checks complete successfully, copyup the file. Would require moving
+ truncate out of may_open().
+
+Todo:
+ - On truncate, only copy up the N bytes of file data requested
+ - Make sure above handles truncate beyond EOF correctly
+ - File copyup on chown()/chmod()/chattr() etc.
+ - File copyup on open(O_APPEND)
+ - File copyup on open(O_DIRECT)
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are
+#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
+nearly all cases (see include/linux/union.h). The union-specific code
+in the cache lookup path is out of line.
+
+Currently, is_unionized() is pretty heavy-weight: it walks up the
+mount hierarchy, grabbing the vfsmount lock at each level. It may be
+possible to simplify this greatly if a writable layer can only cover
+exactly one mount, rather than a tree of mounts.
+
+Todo:
+
+ - Turn copyup in __link_path_walk() into #ifdef'd function
+ - Do performance tests
+ - Optimize is_unionized()
+ - Properly #ifdef out mount path code
+
+Locking strategy
+================
+
+The current writable overlay locking strategy is based on the
+following rules:
+
+* Exactly two file systems are unioned
+* The bottom file system is always read-only
+* The top file system is always read-write
+ => A file system can never a top and a bottom layer at the same time
+
+Additionally, the top layer (the writable overlay) may only be mounted
+exactly once. Don't think of the writable overlay as a separate
+independent file system; when it is mounted as a writable overlay, it
+is only a file system in conjunction with the read-only bottom layer.
+The read-only bottom layer is an independent file system in and of
+itself and can be mounted elsewhere, including as the bottom layer for
+another writable overlay.
+
+Thus, we may define a stable locking order in terms of top layer and
+bottom layer locks, since a top layer is never a bottom layer and a
+bottom layer is never a top layer. Objects from the bottom layer are
+never changed (so don't need write locks) and only require atomic
+operations to manage kernel data structures (ref counts, etc.).
+
+Another simplifying assumption is that all directories in a pathname
+exist on the top layer, as they are created step-by-step during
+lookup. This prevents us from ever having to walk backwards up the
+path creating directory entries, which can get complicated especially
+when you consider the need to prevent topology changes. By
+implication, parent directories during any operation (rename(),
+unlink(),etc.) are from the top layer. Dentries for directories from
+the bottom layer are only ever used by lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file
+systems: A mounted over B, and B mounted over A. Sometimes locks on
+objects in both A and B will have to be held simultanously. What
+order should they be acquired in? Simply acquiring them from top to
+bottom will create a lock-ordering problem - one thread acquires lock
+on object from A and then tries for a lock on object from B, while
+another thread grabs the lock on object from B and then waits for the
+lock on object from A. Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety
+of nasty corner cases arise when more than one layer is changing at
+the same time. Changes in the directory topology and their effect on
+inheritance are of special concern. Al Viro's canonical email on the
+subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first
+place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies
+objects from both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Holds i_mutex on top layer directory inode while doing lookups on
+bottom layer. Grabs i_mutex on bottom layer off and on.
+
+Todo:
+ - Is i_mutex on lower directory necessary?
+
+File copyup in general:
+
+File copyup occurs while holding i_mutex on the parent directory of
+the top layer. As noted before, an in-kernel file copyup is the
+logical equivalent of a userspace rename() of an identical file on to
+this pathname.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top
+layer. Followed by a normal link() operation.
+
+rename():
+
+First, renaming of directories returns EXDEV. It's not at all
+reasonable to recursively copy directory trees and userspace has to
+handle this case anyway.
+
+Rename involves two operations on a writable overlay: (1) creation of
+a whiteout covering the source of the rename, (2) a copyup of the file
+from the bottom layer. The file copyup does not need to happen
+atomically, only the whiteout and the new link to the file.
+
+I propose that we copyup the source file to the "old" name (rather
+than directly to the "new" name), and then perform the normal file
+system rename operation. The only addition is creation of whiteout
+for the old name.
+
+The current rename() implementation is just a hack to get things
+working and doesn't work at all as described above.
+
+Lock order: The file copyup happens before the rename() lock. When we
+create the whiteout, we will already have the directory i_mutex.
+Otherwise, locking as usual.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir(). We hold the
+top layer directory i_mutex throughout. A fallthru is created for
+each entry that appears only on the lower layer.
+
+Current patch takes the i_mutex on the bottom layer directory, which
+doesn't seem to be necessary.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really
+really read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT flag.
+
+Whiteouts and fallthrus are most similar to symlinks, since they
+redirect to an object possibly located in another file system without
+keeping a reference on it.
+
+Todo:
+
+- Return correct inode number in d_ino member of struct dirent by one of:
+ - Save inode number of target in fallthru entry itself
+ - Lookup inode number during readdir()
+- Try re-implementing ext2 as special symlinks - may be much simpler
+- Implement ext3 (also as symlinks?)
+- Implement btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer. File systems must
+explicitly support whiteouts and fallthrus in order to be a read-write
+layer. This patch set implements whiteouts for ext2, tmpfs, and
+jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only
+layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer. NFS as
+read-only layer requires support from the server to honor the
+read-only guarantee needed for the bottom layer. To do this, the
+server needs to revoke access to clients requesting read-only file
+systems if the exported file system is remounted read-write or
+unmounted (during which arbitrary changes can occur). Some recent
+discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the
+->whiteout() and ->fallthru() methods. DT_WHT directory entries are
+theoretically already supported.
+
+Also, technically the requirement for a readdir() cookie that is
+stable across reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Implement whiteout()/fallthru() for NFS
+- Guarantee really really read-only on NFS exports
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass
+the corresponding MS_UNION flag to the kerel. A util-linux git
+tree with writable overlay support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus. An
+e2fsprogs git tree with writable overlay support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.
+While the directory type for whiteouts, DT_WHT, has been defined for
+many years, very little userland code handles them. Userland will
+never see fallthru directory entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+- Link count may be wrong for files on bottom layer with > 1 link count
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+ rename(). Any existing open file descriptors will continue to refer
+ to the read-only copy on the bottom layer and will not see any
+ changes that occur after the copy-up.
+- rename() of directory fails with EXDEV
+
+Status
+======
+
+The current writable overlays patch set varies between RFC/prototype
+and pretty stable, depending on the particular patch. The current
+patch set boots to multi-user mode with a writable overlay root file
+system (albeit with some complaints). Some parts of the code were
+written years ago and have been reviewed, rewritten and tested many
+times. Other parts were written last month and need review,
+rewriting, and testing. The commit messages note the state of each
+patch.
+
+The current patch set is against 2.6.31. You can find it here, in the
+branch "overlay":
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Non-features
+------------
+
+Features we do not currently plan to support as part of writable
+overlays:
+
+Online upgrade: E.g., installing software on a file system NFS
+exported to clients while the clients are still up and running.
+Allowing the read-only bottom layer to change while the writable
+overlay file system is mounted invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across
+layers for directories. Doing an in-kernel copy of a single file is
+bad enough. Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the
+ability to create persistent directory entries on the top layer file
+system (which may be tmpfs). Numerous alternatives (including
+in-kernel or in-application caching) exist and are compatible with
+writable overlays with its writing-readdir() implementation disabled.
+Creating a readdir() cookie that is stable across multiple readdir()s
+requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Aggregation of multiple read-only file systems: While perfectly
+reasonable from a user perspective, we just aren't smart enough to
+figure out the locking problems from a kernel perspective. Sorry!
+
+Often these features are supported by other unioning file systems or
+by other versions of union mounts.
+
+Contributing to writable overlays
+=================================
+
+The writable overlays web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The mailing list for discussing writable overlays is:
+
[email protected]
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
--
1.6.3.3

2010-04-15 23:07:14

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 17/35] union-mount: Introduce MNT_UNION and MS_UNION flags

From: Jan Blunck <[email protected]>

Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:

git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 5 ++++-
include/linux/fs.h | 1 +
include/linux/mount.h | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9a40282..5e4b27b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -808,6 +808,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_STRICTATIME, ",strictatime" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
@@ -2018,10 +2019,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_UNION);

if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a5ba718..4dae882 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,7 @@ struct inodes_stat_t {
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Merge namespace with FS mounted below */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 4bd0547..f6b714c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -43,9 +43,9 @@ struct mnt_namespace;
*/
#define MNT_SHARED_MASK (MNT_UNBINDABLE)
#define MNT_PROPAGATION_MASK (MNT_SHARED | MNT_UNBINDABLE)
+#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */

-
-#define MNT_INTERNAL 0x4000
+#define MNT_INTERNAL 0x8000

struct vfsmount {
struct list_head mnt_hash;
--
1.6.3.3

2010-04-15 23:07:25

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 18/35] union-mount: Introduce union_mount structure and basic operations

From: Jan Blunck <[email protected]>

This patch adds the basic structures and operations of VFS-based union
mounts (but not the ability to mount or lookup unioned file systems).
Each directory in a unioned file system has an associated union stack
created when the directory is first looked up. The union stack is a
structure kept in a hash table indexed by mount and dentry of the
directory; thus, specific paths are unioned, not dentries alone. The
union stack keeps a pointer to the upper path and the lower path and
can be looked up by either path.

This particular version of union mounts is based on ideas by Jan
Blunck, Bharata Rao, and many others.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/Kconfig | 13 ++
fs/Makefile | 1 +
fs/dcache.c | 4 +
fs/union.c | 289 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 20 ++++
include/linux/mount.h | 3 +
include/linux/union.h | 53 +++++++++
7 files changed, 383 insertions(+), 0 deletions(-)
create mode 100644 fs/union.c
create mode 100644 include/linux/union.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7405f07..c16b9db 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -59,6 +59,19 @@ source "fs/notify/Kconfig"

source "fs/quota/Kconfig"

+config UNION_MOUNT
+ bool "Writable overlays (union mounts) (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ Writable overlays allow you to mount a transparent writable
+ layer over a read-only file system, for example, an ext3
+ partition on a hard drive over a CD-ROM root file system
+ image.
+
+ See <file:Documentation/filesystems/union-mounts.txt> for details.
+
+ If unsure, say N.
+
source "fs/autofs/Kconfig"
source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index c3633aa..9693730 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/
obj-$(CONFIG_GENERIC_ACL) += generic_acl.o

obj-y += quota/
+obj-$(CONFIG_UNION_MOUNT) += union.o

obj-$(CONFIG_PROC_FS) += proc/
obj-y += partitions/
diff --git a/fs/dcache.c b/fs/dcache.c
index 1575af4..05c3a1e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -960,6 +960,10 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
+#ifdef CONFIG_UNION_MOUNT
+ INIT_LIST_HEAD(&dentry->d_unions);
+ dentry->d_unionized = 0;
+#endif

if (parent) {
dentry->d_parent = dget(parent);
diff --git a/fs/union.c b/fs/union.c
new file mode 100644
index 0000000..8e74fa4
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,289 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/hash.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/fs_struct.h>
+#include <linux/slab.h>
+#include <linux/union.h>
+
+/*
+ * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
+ * should try to make this good - I've just made it work.
+ */
+static unsigned int union_hash_mask __read_mostly;
+static unsigned int union_hash_shift __read_mostly;
+static struct hlist_head *union_hashtable __read_mostly;
+static unsigned int union_rhash_mask __read_mostly;
+static unsigned int union_rhash_shift __read_mostly;
+static struct hlist_head *union_rhashtable __read_mostly;
+
+/*
+ * Locking Rules:
+ * - dcache_lock (for union_rlookup() only)
+ * - union_lock
+ */
+DEFINE_SPINLOCK(union_lock);
+
+static struct kmem_cache *union_cache __read_mostly;
+
+static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
+{
+ unsigned long tmp;
+
+ tmp = ((unsigned long)mnt * (unsigned long)dentry) ^
+ (GOLDEN_RATIO_PRIME + (unsigned long)mnt) / L1_CACHE_BYTES;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> union_hash_shift);
+ return tmp & union_hash_mask;
+}
+
+static __initdata unsigned long union_hash_entries;
+
+static int __init set_union_hash_entries(char *str)
+{
+ if (!str)
+ return 0;
+ union_hash_entries = simple_strtoul(str, &str, 0);
+ return 1;
+}
+
+__setup("union_hash_entries=", set_union_hash_entries);
+
+static int __init init_union(void)
+{
+ int loop;
+
+ union_cache = KMEM_CACHE(union_mount, SLAB_PANIC | SLAB_MEM_SPREAD);
+ union_hashtable = alloc_large_system_hash("Union-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ &union_hash_shift,
+ &union_hash_mask,
+ 0);
+
+ for (loop = 0; loop < (1 << union_hash_shift); loop++)
+ INIT_HLIST_HEAD(&union_hashtable[loop]);
+
+
+ union_rhashtable = alloc_large_system_hash("rUnion-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ &union_rhash_shift,
+ &union_rhash_mask,
+ 0);
+
+ for (loop = 0; loop < (1 << union_rhash_shift); loop++)
+ INIT_HLIST_HEAD(&union_rhashtable[loop]);
+
+ return 0;
+}
+
+fs_initcall(init_union);
+
+static struct union_mount *union_alloc(struct path *upper, struct path *lower)
+{
+ struct union_mount *um;
+
+ BUG_ON(!S_ISDIR(upper->dentry->d_inode->i_mode));
+ BUG_ON(!S_ISDIR(lower->dentry->d_inode->i_mode));
+
+ um = kmem_cache_alloc(union_cache, GFP_ATOMIC);
+ if (!um)
+ return NULL;
+
+ atomic_set(&um->u_count, 1);
+ INIT_LIST_HEAD(&um->u_unions);
+ INIT_HLIST_NODE(&um->u_hash);
+ INIT_HLIST_NODE(&um->u_rhash);
+
+ um->u_upper.mnt = upper->mnt;
+ um->u_upper.dentry = upper->dentry;
+ um->u_lower.mnt = mntget(lower->mnt);
+ um->u_lower.dentry = dget(lower->dentry);
+
+ return um;
+}
+
+struct union_mount *union_get(struct union_mount *um)
+{
+ BUG_ON(!atomic_read(&um->u_count));
+ atomic_inc(&um->u_count);
+ return um;
+}
+
+static int __union_put(struct union_mount *um)
+{
+ if (!atomic_dec_and_test(&um->u_count))
+ return 0;
+
+ BUG_ON(!hlist_unhashed(&um->u_hash));
+ BUG_ON(!hlist_unhashed(&um->u_rhash));
+
+ kmem_cache_free(union_cache, um);
+ return 1;
+}
+
+void union_put(struct union_mount *um)
+{
+ struct path tmp = um->u_lower;
+
+ if (__union_put(um))
+ path_put(&tmp);
+}
+
+static void __union_hash(struct union_mount *um)
+{
+ hlist_add_head(&um->u_hash, union_hashtable +
+ hash(um->u_upper.dentry, um->u_upper.mnt));
+ hlist_add_head(&um->u_rhash, union_rhashtable +
+ hash(um->u_lower.dentry, um->u_lower.mnt));
+}
+
+static void __union_unhash(struct union_mount *um)
+{
+ hlist_del_init(&um->u_hash);
+ hlist_del_init(&um->u_rhash);
+}
+
+static struct union_mount *union_cache_lookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct hlist_head *head = union_hashtable + hash(dentry, mnt);
+ struct hlist_node *node;
+ struct union_mount *um;
+
+ hlist_for_each_entry(um, node, head, u_hash) {
+ if ((um->u_upper.dentry == dentry) &&
+ (um->u_upper.mnt == mnt))
+ return um;
+ }
+
+ return NULL;
+}
+
+static struct union_mount *union_cache_rlookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct hlist_head *head = union_rhashtable + hash(dentry, mnt);
+ struct hlist_node *node;
+ struct union_mount *um;
+
+ hlist_for_each_entry(um, node, head, u_rhash) {
+ if ((um->u_lower.dentry == dentry) &&
+ (um->u_lower.mnt == mnt))
+ return um;
+ }
+
+ return NULL;
+}
+
+/*
+ * append_to_union - add a path to the bottom of the union stack
+ *
+ * Allocate and attach a union cache entry linking the new, upper
+ * mnt/dentry to the "covered" matching lower mnt/dentry. It's okay
+ * if the union cache entry already exists.
+ */
+
+int append_to_union(struct path *upper, struct path *lower)
+{
+ struct union_mount *new, *um;
+
+ BUG_ON(!S_ISDIR(upper->dentry->d_inode->i_mode));
+ BUG_ON(!S_ISDIR(lower->dentry->d_inode->i_mode));
+
+ /* Common case is that it's already been created, do a lookup first */
+
+ spin_lock(&union_lock);
+ um = union_cache_lookup(upper->dentry, upper->mnt);
+ if (um) {
+ BUG_ON((um->u_lower.dentry != lower->dentry) ||
+ (um->u_lower.mnt != lower->mnt));
+ spin_unlock(&union_lock);
+ return 0;
+ }
+ spin_unlock(&union_lock);
+
+ new = union_alloc(upper, lower);
+ if (!new)
+ return -ENOMEM;
+
+ spin_lock(&union_lock);
+ um = union_cache_lookup(upper->dentry, upper->mnt);
+ if (um) {
+ /* Someone added it while we were allocating, no problem */
+ BUG_ON((um->u_lower.dentry != lower->dentry) ||
+ (um->u_lower.mnt != lower->mnt));
+ spin_unlock(&union_lock);
+ union_put(new);
+ return 0;
+ }
+ __union_hash(new);
+ spin_unlock(&union_lock);
+ return 0;
+}
+
+/*
+ * WARNING! Confusing terminology alert.
+ *
+ * Note that the directions "up" and "down" in union mounts are the
+ * opposite of "up" and "down" in normal VFS operation terminology.
+ * "up" in the rest of the VFS means "towards the root of the mount
+ * tree." If you mount B on top of A, following B "up" will get you
+ * A. In union mounts, "up" means "towards the most recently mounted
+ * layer of the union stack." If you union mount B on top of A,
+ * following A "up" will get you to B. Another way to put it is that
+ * "up" in the VFS means going from this mount towards the direction
+ * of its mnt->mnt_parent pointer, but "up" in union mounts means
+ * going in the opposite direction (until you run out of union
+ * layers).
+ */
+
+/*
+ * union_down_one - get the next lower directory in the union stack
+ *
+ * This is called to traverse the union stack from the given layer to
+ * the next lower layer. union_down_one() is called by various
+ * lookup functions that are aware of union mounts.
+ *
+ * Returns non-zero if followed to the next lower layer, zero otherwise.
+ *
+ * See note on up/down terminology above.
+ */
+int union_down_one(struct vfsmount **mnt, struct dentry **dentry)
+{
+ struct union_mount *um;
+
+ if (!IS_MNT_UNION(*mnt))
+ return 0;
+
+ spin_lock(&union_lock);
+ um = union_cache_lookup(*dentry, *mnt);
+ spin_unlock(&union_lock);
+ if (um) {
+ path_get(&um->u_lower);
+ dput(*dentry);
+ *dentry = um->u_lower.dentry;
+ mntput(*mnt);
+ *mnt = um->u_lower.mnt;
+ return 1;
+ }
+ return 0;
+}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index e035c51..d6c1da2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -101,6 +101,26 @@ struct dentry {
struct dentry *d_parent; /* parent directory */
struct qstr d_name;

+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * Stacks of union mount structures are connected to dentries
+ * through the d_unions field. If this list is not empty,
+ * then this dentry is part of a unioned directory stack.
+ * Protected by union_lock.
+ */
+ struct list_head d_unions; /* list of union_mount's */
+ /*
+ * If d_unionized is set, then this dentry is referenced by
+ * the u_next field of a union mount structure - that is, it
+ * is a dentry for a lower layer of a union. d_unionized is
+ * NOT set in the dentry for the topmost layer of a union.
+ *
+ * d_unionized would be better renamed to d_union_lower or
+ * d_union_ref.
+ */
+ unsigned int d_unionized; /* unions referencing this dentry */
+#endif
+
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f6b714c..0517114 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -64,6 +64,9 @@ struct vfsmount {
struct list_head mnt_slave_list;/* list of slave mounts */
struct list_head mnt_slave; /* slave list entry */
struct vfsmount *mnt_master; /* slave is on master->mnt_slave_list */
+#ifdef CONFIG_UNION_MOUNT
+ struct list_head mnt_unions; /* list of union_mount structures */
+#endif
struct mnt_namespace *mnt_ns; /* containing namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
diff --git a/include/linux/union.h b/include/linux/union.h
new file mode 100644
index 0000000..bc83a2f
--- /dev/null
+++ b/include/linux/union.h
@@ -0,0 +1,53 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#include <linux/list.h>
+#include <asm/atomic.h>
+
+struct dentry;
+struct vfsmount;
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * The union mount structure.
+ */
+struct union_mount {
+ atomic_t u_count; /* reference count */
+ struct list_head u_unions; /* list head for d_unions */
+ struct list_head u_list; /* list head for mnt_unions */
+ struct hlist_node u_hash; /* list head for searching */
+ struct hlist_node u_rhash; /* list head for reverse searching */
+
+ struct path u_upper; /* this is me */
+ struct path u_lower; /* this is what I overlay */
+};
+
+#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+
+extern int append_to_union(struct path *, struct path*);
+extern int union_down_one(struct vfsmount **, struct dentry **);
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define IS_MNT_UNION(x) (0)
+#define append_to_union(x, y) ({ BUG(); (0); })
+#define union_down_one(x, y) ({ (0); })
+
+#endif /* CONFIG_UNION_MOUNT */
+#endif /* __KERNEL__ */
+#endif /* __LINUX_UNION_H */
--
1.6.3.3

2010-04-15 23:07:35

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 19/35] union-mount: Drive the union cache via dcache

From: Jan Blunck <[email protected]>

If a dentry is removed from dentry cache because its usage count drops to
zero, the references to the underlying layer of the unions the dentry is in
are dropped too. Therefore the union cache is driven by the dentry cache.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 13 +++++++++++
fs/union.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 8 ++++++
include/linux/union.h | 4 +++
4 files changed, 81 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 05c3a1e..983a1ea 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -18,6 +18,7 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/union.h>
#include <linux/fsnotify.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -175,6 +176,8 @@ static struct dentry *d_kill(struct dentry *dentry)
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
+ /* If the dentry was in an union delete them */
+ shrink_d_unions(dentry);
if (IS_ROOT(dentry))
parent = NULL;
else
@@ -696,6 +699,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
iput(inode);
}

+ shrink_d_unions(dentry);
d_free(dentry);

/* finished when we fall off the top of the tree,
@@ -1535,7 +1539,9 @@ void d_delete(struct dentry * dentry)
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (atomic_read(&dentry->d_count) == 1) {
+ __d_drop_unions(dentry);
dentry_iput(dentry);
+ shrink_d_unions(dentry);
fsnotify_nameremove(dentry, isdir);
return;
}
@@ -1546,6 +1552,13 @@ void d_delete(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

+ /*
+ * Remove any associated unions. While someone still has this
+ * directory open (ref count > 0), we could not have deleted
+ * it unless it was empty, and therefore has no references to
+ * directories below it. So we don't need the unions.
+ */
+ shrink_d_unions(dentry);
fsnotify_nameremove(dentry, isdir);
}
EXPORT_SYMBOL(d_delete);
diff --git a/fs/union.c b/fs/union.c
index 8e74fa4..4168b62 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -14,6 +14,7 @@

#include <linux/bootmem.h>
#include <linux/init.h>
+#include <linux/module.h>
#include <linux/types.h>
#include <linux/hash.h>
#include <linux/fs.h>
@@ -235,6 +236,8 @@ int append_to_union(struct path *upper, struct path *lower)
union_put(new);
return 0;
}
+ list_add(&new->u_unions, &upper->dentry->d_unions);
+ lower->dentry->d_unionized++;
__union_hash(new);
spin_unlock(&union_lock);
return 0;
@@ -287,3 +290,56 @@ int union_down_one(struct vfsmount **mnt, struct dentry **dentry)
}
return 0;
}
+
+/**
+ * __d_drop_unions - remove all this dentry's unions from the union hash table
+ *
+ * @dentry - topmost dentry in the union stack to remove
+ *
+ * This must be called after unhashing a dentry. This is called with
+ * dcache_lock held and unhashes all the unions this dentry is
+ * attached to.
+ */
+void __d_drop_unions(struct dentry *dentry)
+{
+ struct union_mount *this, *next;
+
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions)
+ __union_unhash(this);
+ spin_unlock(&union_lock);
+}
+EXPORT_SYMBOL_GPL(__d_drop_unions);
+
+/*
+ * This must be called after __d_drop_unions() without holding any locks.
+ * Note: The dentry might still be reachable via a lookup but at that time it
+ * already a negative dentry. Otherwise it would be unhashed. The union_mount
+ * structure itself is still reachable through mnt->mnt_unions (which we
+ * protect against with union_lock).
+ *
+ * We were worried about a recursive dput() call through:
+ *
+ * dput()->d_kill()->shrink_d_unions()->union_put()->dput()
+ *
+ * But this path can only be reached if the dentry is unhashed when we
+ * enter the first dput(), and it can only be unhashed if it was
+ * rmdir()'d, and d_delete() calls shrink_d_unions() for us.
+ */
+void shrink_d_unions(struct dentry *dentry)
+{
+ struct union_mount *this, *next;
+
+repeat:
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
+ BUG_ON(!hlist_unhashed(&this->u_hash));
+ BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_unions);
+ this->u_lower.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(this);
+ goto repeat;
+ }
+ spin_unlock(&union_lock);
+}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d6c1da2..bfa8f97 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -227,12 +227,20 @@ extern seqlock_t rename_lock;
* __d_drop requires dentry->d_lock.
*/

+#ifdef CONFIG_UNION_MOUNT
+extern void __d_drop_unions(struct dentry *);
+#endif
+
static inline void __d_drop(struct dentry *dentry)
{
if (!(dentry->d_flags & DCACHE_UNHASHED)) {
dentry->d_flags |= DCACHE_UNHASHED;
hlist_del_rcu(&dentry->d_hash);
}
+#ifdef CONFIG_UNION_MOUNT
+ /* remove dentry from the union hashtable */
+ __d_drop_unions(dentry);
+#endif
}

static inline void d_drop(struct dentry *dentry)
diff --git a/include/linux/union.h b/include/linux/union.h
index bc83a2f..c5ead54 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -41,12 +41,16 @@ struct union_mount {

extern int append_to_union(struct path *, struct path*);
extern int union_down_one(struct vfsmount **, struct dentry **);
+extern void __d_drop_unions(struct dentry *);
+extern void shrink_d_unions(struct dentry *);

#else /* CONFIG_UNION_MOUNT */

#define IS_MNT_UNION(x) (0)
#define append_to_union(x, y) ({ BUG(); (0); })
#define union_down_one(x, y) ({ (0); })
+#define __d_drop_unions(x) do { } while (0)
+#define shrink_d_unions(x) do { } while (0)

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-04-15 23:07:51

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 22/35] union-mount: Call do_whiteout() on unlink and rmdir in unions

From: Jan Blunck <[email protected]>

Call do_whiteout() when removing files and directories from a union
mounted file system.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b179062..900df0f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2670,6 +2670,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit3;
+ if (IS_UNIONED_DIR(&nd.path)) {
+ error = do_whiteout(&nd, &path, 1);
+ goto exit4;
+ }
error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
@@ -2759,6 +2763,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
+ if (IS_UNIONED_DIR(&nd.path)) {
+ error = do_whiteout(&nd, &path, 0);
+ goto exit3;
+ }
error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
--
1.6.3.3

2010-04-15 23:07:56

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 27/35] union-mount: Implement union-aware access()/faccessat()

For union mounts, a file located on the lower layer will incorrectly
return EROFS on an access check. To fix this, use the new
path_permission() call, which ignores a read-only lower layer file
system if the target will be copied up to the topmost file system.
---
fs/open.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e17f544..686fcd2 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -454,7 +454,10 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
const struct cred *old_cred;
struct cred *override_cred;
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int res;

if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
@@ -478,10 +481,17 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)

old_cred = override_creds(override_cred);

- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ res = user_path_nd(dfd, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (res)
goto out;

+ /* For union mounts, use the topmost mnt's permissions */
+ if (IS_UNIONED_DIR(&nd.path))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
inode = path.dentry->d_inode;

if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
@@ -490,11 +500,11 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (mnt->mnt_flags & MNT_NOEXEC)
goto out_path_release;
}

- res = inode_permission(inode, mode | MAY_ACCESS);
+ res = path_permission(&path, &nd.path, mode | MAY_ACCESS);
/* SuS v2 requires we report a read only fs too */
if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
goto out_path_release;
@@ -508,11 +518,13 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if (__mnt_is_readonly(mnt))
res = -EROFS;

out_path_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
revert_creds(old_cred);
put_cred(override_cred);
--
1.6.3.3

2010-04-15 23:08:07

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 30/35] union-mount: Implement union-aware writable open()

Copy up a file when opened with write permissions. Does not copy up
the file data when O_TRUNC is specified.
---
fs/namei.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a6f7d5d..85a5451 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1929,6 +1929,24 @@ exit:
return ERR_PTR(error);
}

+static int open_union_copyup(struct nameidata *nd, struct path *path,
+ int open_flag)
+{
+ struct vfsmount *oldmnt = path->mnt;
+ int error;
+
+ if (open_flag & O_TRUNC)
+ error = union_copyup_len(nd, path, 0);
+ else
+ error = union_copyup(nd, path);
+ if (error)
+ return error;
+ if (oldmnt != path->mnt)
+ mntput(nd->path.mnt);
+
+ return error;
+}
+
static struct file *do_last(struct nameidata *nd, struct path *path,
int open_flag, int acc_mode,
int mode, const char *pathname,
@@ -1979,6 +1997,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
error = -ENOTDIR;
if (*want_dir && !path->dentry->d_inode->i_op->lookup)
goto exit_dput;
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
audit_inode(pathname, nd->path.dentry);
goto ok;
@@ -2050,6 +2073,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (path->dentry->d_inode->i_op->follow_link)
return NULL;

+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
error = -EISDIR;
if (S_ISDIR(path->dentry->d_inode->i_mode))
--
1.6.3.3

2010-04-15 23:07:44

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 21/35] union-mount: Support for mounting union mount file systems

Create and tear down union mount structures on mount. Check
requirements for union mounts.

Thanks to Felix Fietkau <[email protected]> for a bug fix.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/union.c | 63 ++++++++++++++++++++++++
include/linux/union.h | 4 ++
3 files changed, 196 insertions(+), 1 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5e4b27b..e19a432 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -29,6 +29,7 @@
#include <linux/log2.h>
#include <linux/idr.h>
#include <linux/fs_struct.h>
+#include <linux/union.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
#include "pnode.h"
@@ -157,6 +158,9 @@ struct vfsmount *alloc_vfsmnt(const char *name)
#else
mnt->mnt_writers = 0;
#endif
+#ifdef CONFIG_UNION_MOUNT
+ INIT_LIST_HEAD(&mnt->mnt_unions);
+#endif
}
return mnt;

@@ -492,6 +496,7 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)

static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
{
+ detach_mnt_union(mnt);
old_path->dentry = mnt->mnt_mountpoint;
old_path->mnt = mnt->mnt_parent;
mnt->mnt_parent = mnt;
@@ -515,6 +520,7 @@ static void attach_mnt(struct vfsmount *mnt, struct path *path)
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(path->mnt, path->dentry));
list_add_tail(&mnt->mnt_child, &path->mnt->mnt_mounts);
+ attach_mnt_union(mnt, path->mnt);
}

/*
@@ -537,6 +543,7 @@ static void commit_tree(struct vfsmount *mnt)
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(parent, mnt->mnt_mountpoint));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+ attach_mnt_union(mnt, parent);
touch_mnt_namespace(n);
}

@@ -1025,6 +1032,7 @@ void release_mounts(struct list_head *head)
struct dentry *dentry;
struct vfsmount *m;
spin_lock(&vfsmount_lock);
+ detach_mnt_union(mnt);
dentry = mnt->mnt_mountpoint;
m = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
@@ -1139,6 +1147,12 @@ static int do_umount(struct vfsmount *mnt, int flags)
if (!list_empty(&mnt->mnt_list))
umount_tree(mnt, 1, &umount_list);
retval = 0;
+ /*
+ * If this was a union mount, we are no longer a
+ * read-only user on the underlying mount.
+ */
+ if (mnt->mnt_flags & MNT_UNION)
+ dec_hard_readonly_users(mnt->mnt_parent);
}
spin_unlock(&vfsmount_lock);
if (retval)
@@ -1490,6 +1504,17 @@ static int do_change_type(struct path *path, int flag)
return -EINVAL;

down_write(&namespace_sem);
+
+ /*
+ * Mounts of file systems with read-only users can't deal with
+ * mount/umount propagation events - it's the moral equivalent
+ * of rm -rf dir/ or the like.
+ */
+ if (sb_is_hard_readonly(mnt->mnt_sb)) {
+ err = -EROFS;
+ goto out_unlock;
+ }
+
if (type == MS_SHARED) {
err = invent_group_ids(mnt, recurse);
if (err)
@@ -1507,6 +1532,77 @@ static int do_change_type(struct path *path, int flag)
}

/*
+ * Mount-time check of upper and lower layer file systems to see if we
+ * can union mount one on the other.
+ *
+ * Note on union mounts and mount event propagation: The lower
+ * layer(s) of a union mount must not have any changes to its
+ * namespace. Therefore, it must not be part of any mount event
+ * propagation group - i.e., shared or slave. MNT_SHARED and
+ * MNT_SLAVE are not set at mount, but in do_change_type(), which
+ * prevents setting these flags on file systems with read-only users,
+ * which includes the lower layer(s) of a union mount.
+ */
+
+static int
+check_union_mnt(struct path *mntpnt, struct vfsmount *topmost_mnt, int mnt_flags)
+{
+ struct vfsmount *lower_mnt = mntpnt->mnt;
+
+ if (!(mnt_flags & MNT_UNION))
+ return 0;
+
+#ifndef CONFIG_UNION_MOUNT
+ return -EINVAL;
+#endif
+ /*
+ * We can't deal with namespace changes in the lower layers of
+ * a union, so the lower layer must be read-only. Note that
+ * we could possibly convert a read-write unioned mount into a
+ * read-only mount here, which would give us a way to union
+ * more than one layer with separate mount commands. But
+ * first we have to solve the locking order problems with more
+ * than two layers of union.
+ */
+ if (!(lower_mnt->mnt_sb->s_flags & MS_RDONLY))
+ return -EBUSY;
+
+ /*
+ * WRITEME: For simplicity, the lower layer can't have
+ * submounts. If there's a good reason, we could recursively
+ * check the whole subtree for read-only-ness, etc. and it
+ * would probably work fine.
+ */
+ if (!list_empty(&lower_mnt->mnt_mounts))
+ return -EBUSY;
+
+ /*
+ * Only permit unioning of file systems at their root
+ * directories. This allows us to mark entire mounts as
+ * unioned. Otherwise we must slowly and expensively work our
+ * way up a path looking for a unioned directory before we
+ * know if a path is from a unioned lower layer.
+ */
+
+ if (!IS_ROOT(mntpnt->dentry))
+ return -EINVAL;
+
+ /*
+ * Topmost layer must be writable to support our readdir()
+ * solution of copying up all lower level entries to the
+ * topmost layer.
+ */
+ if (mnt_flags & MNT_READONLY)
+ return -EROFS;
+
+ /* Topmost file system must support whiteouts and fallthrus. */
+ if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT))
+ return -EINVAL;
+
+ return 0;
+}
+
+/*
* do loopback mount.
*/
static int do_loopback(struct path *path, char *old_name,
@@ -1527,6 +1623,9 @@ static int do_loopback(struct path *path, char *old_name,
err = -EINVAL;
if (IS_MNT_UNBINDABLE(old_path.mnt))
goto out;
+ /* Mount part of a union mount elsewhere? The mind boggles. */
+ if (IS_MNT_UNION(old_path.mnt))
+ goto out;

if (!check_mnt(path->mnt) || !check_mnt(old_path.mnt))
goto out;
@@ -1548,7 +1647,6 @@ static int do_loopback(struct path *path, char *old_name,
spin_unlock(&vfsmount_lock);
release_mounts(&umount_list);
}
-
out:
up_write(&namespace_sem);
path_put(&old_path);
@@ -1589,6 +1687,17 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
if (!check_mnt(path->mnt))
return -EINVAL;

+ if (mnt_flags & MNT_UNION)
+ return -EINVAL;
+
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ !(mnt_flags & MNT_UNION))
+ return -EINVAL;
+
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ (mnt_flags & MNT_READONLY))
+ return -EINVAL;
+
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;

@@ -1641,6 +1750,9 @@ static int do_move_mount(struct path *path, char *old_name)
while (d_mountpoint(path->dentry) &&
follow_down(path))
;
+ /* Get the lowest layer of a union mount to move the whole stack */
+ while (union_down_one(&old_path.mnt, &old_path.dentry))
+ ;
err = -EINVAL;
if (!check_mnt(path->mnt) || !check_mnt(old_path.mnt))
goto out;
@@ -1753,10 +1865,18 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ err = check_union_mnt(path, newmnt, mnt_flags);
+ if (err)
+ goto unlock;
+
newmnt->mnt_flags = mnt_flags;
if ((err = graft_tree(newmnt, path)))
goto unlock;

+ /* Union mounts require the lower layer to always be read-only */
+ if (mnt_flags & MNT_UNION)
+ inc_hard_readonly_users(newmnt->mnt_parent);
+
if (fslist) /* add to the specified expiration list */
list_add_tail(&newmnt->mnt_expire, fslist);

@@ -2267,6 +2387,14 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
if (d_unlinked(old.dentry))
goto out2;
error = -EBUSY;
+ /*
+ * We want the bottom-most layer of a union mount here - if we
+ * move that around, all the layers on top move with it.
+ */
+ while (union_down_one(&new.mnt, &new.dentry))
+ ;
+ while (union_down_one(&root.mnt, &root.dentry))
+ ;
if (new.mnt == root.mnt ||
old.mnt == root.mnt)
goto out2; /* loop, on the same file system */
diff --git a/fs/union.c b/fs/union.c
index 5011d26..8ad9de7 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -114,6 +114,7 @@ static struct union_mount *union_alloc(struct path *upper, struct path *lower)

atomic_set(&um->u_count, 1);
INIT_LIST_HEAD(&um->u_unions);
+ INIT_LIST_HEAD(&um->u_list);
INIT_HLIST_NODE(&um->u_hash);
INIT_HLIST_NODE(&um->u_rhash);

@@ -274,6 +275,7 @@ int append_to_union(struct path *upper, struct path *lower)
union_put(new);
return 0;
}
+ list_add(&new->u_list, &upper->mnt->mnt_unions);
list_add(&new->u_unions, &upper->dentry->d_unions);
lower->dentry->d_unionized++;
__union_hash(new);
@@ -373,6 +375,7 @@ repeat:
list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
BUG_ON(!hlist_unhashed(&this->u_hash));
BUG_ON(!hlist_unhashed(&this->u_rhash));
+ list_del(&this->u_list);
list_del(&this->u_unions);
this->u_lower.dentry->d_unionized--;
spin_unlock(&union_lock);
@@ -383,6 +386,66 @@ repeat:
}

/*
+ * Remove all union_mounts structures belonging to this vfsmount from the
+ * union lookup hashtable and so on ...
+ */
+void shrink_mnt_unions(struct vfsmount *mnt)
+{
+ struct union_mount *this, *next;
+
+repeat:
+ spin_lock(&union_lock);
+ list_for_each_entry_safe(this, next, &mnt->mnt_unions, u_list) {
+ if (this->u_upper.dentry == mnt->mnt_root)
+ continue;
+ __union_unhash(this);
+ list_del(&this->u_list);
+ list_del(&this->u_unions);
+ this->u_lower.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(this);
+ goto repeat;
+ }
+ spin_unlock(&union_lock);
+}
+
+int attach_mnt_union(struct vfsmount *upper_mnt, struct vfsmount *lower_mnt)
+{
+ struct path upper, lower;
+ if (!IS_MNT_UNION(upper_mnt))
+ return 0;
+
+ /* Make a union of the root dirs of the upper and lower mounts */
+ upper.mnt = upper_mnt;
+ upper.dentry = upper_mnt->mnt_root;
+
+ lower.mnt = lower_mnt;
+ lower.dentry = lower_mnt->mnt_root;
+
+ return append_to_union(&upper, &lower);
+}
+
+void detach_mnt_union(struct vfsmount *mnt)
+{
+ struct union_mount *um;
+
+ if (!IS_MNT_UNION(mnt))
+ return;
+
+ shrink_mnt_unions(mnt);
+
+ spin_lock(&union_lock);
+ um = union_cache_lookup(mnt->mnt_root, mnt);
+ __union_unhash(um);
+ list_del(&um->u_list);
+ list_del(&um->u_unions);
+ um->u_lower.dentry->d_unionized--;
+ spin_unlock(&union_lock);
+ union_put(um);
+ return;
+}
+
+/*
* union_create_topmost_dir - Create a matching dir in the topmost file system
*/

diff --git a/include/linux/union.h b/include/linux/union.h
index 681b472..189a84d 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -49,6 +49,8 @@ extern void __d_drop_unions(struct dentry *);
extern void shrink_d_unions(struct dentry *);
extern struct dentry * union_create_topmost_dir(struct path *, struct qstr *,
struct path *);
+extern int attach_mnt_union(struct vfsmount *, struct vfsmount *);
+extern void detach_mnt_union(struct vfsmount *);

#else /* CONFIG_UNION_MOUNT */

@@ -60,6 +62,8 @@ extern struct dentry * union_create_topmost_dir(struct path *, struct qstr *,
#define __d_drop_unions(x) do { } while (0)
#define shrink_d_unions(x) do { } while (0)
#define union_create_topmost_dir(x, y, z) ({ BUG(); (NULL); })
+#define attach_mnt_union(x, y) do { } while (0)
+#define detach_mnt_union(x) do { } while (0)

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-04-15 23:08:39

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 35/35] union-mount: Implement union-aware utimensat()

XXX - doesn't implement NOFOLLOW correctly
---
fs/utimes.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..82feca2 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -8,6 +8,7 @@
#include <linux/stat.h>
#include <linux/utime.h>
#include <linux/syscalls.h>
+#include <linux/union.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>

@@ -152,18 +153,26 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags
error = utimes_common(&file->f_path, times);
fput(file);
} else {
+ struct nameidata nd;
+ char *tmp;
struct path path;
int lookup_flags = 0;

if (!(flags & AT_SYMLINK_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;

- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_nd(dfd, filename, lookup_flags, &nd, &path,
+ &tmp);
if (error)
goto out;

- error = utimes_common(&path, times);
+ error = union_copyup(&nd, &path);
+
+ if (!error)
+ error = utimes_common(&path, times);
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
}

out:
--
1.6.3.3

2010-04-15 23:08:55

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 34/35] union-mount: Implement union-aware lchown()

---
fs/open.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 6ec99e9..dc65b27 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -812,18 +812,35 @@ out:
SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_lpath(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, 0, &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_UNIONED_DIR(&nd.path))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-04-15 23:08:59

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 33/35] union-mount: Implement union-aware chmod()/fchmodat()

---
fs/open.c | 25 +++++++++++++++++++++----
1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index dda1b6f..6ec99e9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -669,18 +669,32 @@ out:
SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;
struct iattr newattrs;

- error = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ error = user_path_nd(dfd, filename, LOOKUP_FOLLOW, &nd,
+ &path, &tmp);
if (error)
goto out;
- inode = path.dentry->d_inode;

- error = mnt_want_write(path.mnt);
+ if (IS_UNIONED_DIR(&nd.path))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ inode = path.dentry->d_inode;
mutex_lock(&inode->i_mutex);
error = security_path_chmod(path.dentry, path.mnt, mode);
if (error)
@@ -692,9 +706,12 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
error = notify_change(path.dentry, &newattrs);
out_unlock:
mutex_unlock(&inode->i_mutex);
- mnt_drop_write(path.mnt);
+mnt_drop_write_and_out:
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-04-15 23:09:32

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 32/35] union-mount: Implement union-aware truncate()

---
fs/open.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 325852d..dda1b6f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -230,14 +230,17 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
static long do_sys_truncate(const char __user *pathname, loff_t length)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;

error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;

- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
goto out;
inode = path.dentry->d_inode;
@@ -251,11 +254,16 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (!S_ISREG(inode->i_mode))
goto dput_and_out;

- error = mnt_want_write(path.mnt);
+ if (IS_UNIONED_DIR(&nd.path))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;

- error = inode_permission(inode, MAY_WRITE);
+ error = path_permission(&path, &nd.path, MAY_WRITE);
if (error)
goto mnt_drop_write_and_out;

@@ -263,6 +271,12 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (IS_APPEND(inode))
goto mnt_drop_write_and_out;

+ error = union_copyup_len(&nd, &path, length);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ /* path may have changed after copyup */
+ inode = path.dentry->d_inode;
error = get_write_access(inode);
if (error)
goto mnt_drop_write_and_out;
@@ -284,9 +298,11 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
put_write_and_out:
put_write_access(inode);
mnt_drop_write_and_out:
- mnt_drop_write(path.mnt);
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-04-15 23:09:49

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 31/35] union-mount: Implement union-aware chown()

Proof-of-concept implementation of chown() for union mounts.
---
fs/open.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 686fcd2..325852d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -30,6 +30,7 @@
#include <linux/falloc.h>
#include <linux/fs_struct.h>
#include <linux/ima.h>
+#include <linux/union.h>

#include "internal.h"

@@ -717,18 +718,35 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_path(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_UNIONED_DIR(&nd.path))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-04-15 23:10:17

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 28/35] union-mount: Implement union-aware link()

---
fs/namei.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 68aa8ab..5f6dcd4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3019,16 +3019,18 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct nameidata nd;
+ struct nameidata old_nd;
struct path old_path;
int error;
char *to;
+ char *from;

if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
return -EINVAL;

- error = user_path_at(olddfd, oldname,
+ error = user_path_nd(olddfd, oldname,
flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
- &old_path);
+ &old_nd, &old_path, &from);
if (error)
return error;

@@ -3036,8 +3038,20 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out;
error = -EXDEV;
- if (old_path.mnt != nd.path.mnt)
- goto out_release;
+ if (old_path.mnt != nd.path.mnt) {
+ if (IS_UNIONED_DIR(&old_nd.path) &&
+ (old_nd.path.mnt == nd.path.mnt)) {
+ error = mnt_want_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ error = union_copyup(&old_nd, &old_path);
+ mnt_drop_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ } else {
+ goto out_release;
+ }
+ }
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
@@ -3060,6 +3074,8 @@ out_release:
putname(to);
out:
path_put(&old_path);
+ path_put(&old_nd.path);
+ putname(from);

return error;
}
--
1.6.3.3

2010-04-15 23:10:11

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 29/35] union-mount: Implement union-aware rename()

On rename() of a file on union mount, copyup and whiteout the source
file. Both are done under the rename mutex. I believe this is
actually atomic.

XXX - May not need to do file copyup under the lock.
---
fs/namei.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5f6dcd4..a6f7d5d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3233,6 +3233,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
{
struct dentry *old_dir, *new_dir;
struct path old, new;
+ struct path to_whiteout = {NULL, NULL};
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -3248,12 +3249,9 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit1;

error = -EXDEV;
+ /* Union mounts will pass below test - dirs always on topmost */
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
- /* Rename on union mounts not implemented yet */
- /* XXX much harsher check than necessary - can do some renames */
- if (IS_UNIONED_DIR(&oldnd.path) || IS_UNIONED_DIR(&newnd.path))
- goto exit2;
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
@@ -3276,7 +3274,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOENT;
if (!old.dentry->d_inode)
goto exit4;
- /* unless the source is a directory trailing slashes give -ENOTDIR */
+ /* unless the source is a directory, trailing slashes give -ENOTDIR */
if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
@@ -3288,6 +3286,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
+ error = -EXDEV;
+ /* Can't rename a directory from a lower layer */
+ if (IS_UNIONED_DIR(&oldnd.path) &&
+ IS_UNIONED_DIR(&old))
+ goto exit4;
error = lookup_hash(&newnd, &newnd.last, &new);
if (error)
goto exit4;
@@ -3295,6 +3298,48 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOTEMPTY;
if (new.dentry == trap)
goto exit5;
+ error = -EXDEV;
+ /* Can't rename over directories on the lower layer */
+ if (IS_UNIONED_DIR(&newnd.path) &&
+ IS_UNIONED_DIR(&new))
+ goto exit4;
+
+ /* If source is on lower layer, copy up */
+ if (IS_UNIONED_DIR(&oldnd.path) &&
+ (old.mnt != oldnd.path.mnt)) {
+ /* Save the lower path to avoid a second lookup for whiteout */
+ to_whiteout.dentry = dget(old.dentry);
+ to_whiteout.mnt = mntget(old.mnt);
+ error = __union_copyup(&oldnd, &old);
+ if (error)
+ goto exit5;
+ }
+
+ /* If target is on lower layer, get negative dentry for topmost */
+ if (IS_UNIONED_DIR(&newnd.path) &&
+ (new.mnt != newnd.path.mnt)) {
+ struct dentry *dentry;
+ /*
+ * At this point, source and target are both files,
+ * the source is on the topmost layer, and the target
+ * is on a lower layer. We want the target dentry to
+ * disappear from the namespace, and give vfs_rename a
+ * negative dentry from the topmost layer.
+ */
+ /* We already did lookup once, no need to check perm */
+ dentry = __lookup_hash(&newnd.last, newnd.path.dentry, &newnd);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto exit5;
+ }
+ /* We no longer need the lower target dentry. It
+ * definitely should be removed from the hash table */
+ /* XXX what about failure case? */
+ d_delete(new.dentry);
+ mntput(new.mnt);
+ new.mnt = mntget(newnd.path.mnt);
+ new.dentry = dentry;
+ }

error = mnt_want_write(oldnd.path.mnt);
if (error)
@@ -3305,6 +3350,26 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit6;
error = vfs_rename(old_dir->d_inode, old.dentry,
new_dir->d_inode, new.dentry);
+ if (error)
+ goto exit6;
+ /* Now whiteout the source */
+ if (IS_UNIONED_DIR(&oldnd.path)) {
+ if (!to_whiteout.dentry) {
+ struct dentry *dentry;
+ /* We could have exposed a lower level entry */
+ dentry = __lookup_hash(&oldnd.last, oldnd.path.dentry, &oldnd);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto exit6;
+ }
+ to_whiteout.dentry = dentry;
+ to_whiteout.mnt = mntget(oldnd.path.mnt);
+ }
+
+ if (to_whiteout.dentry->d_inode)
+ error = do_whiteout(&oldnd, &to_whiteout, 0);
+ path_put(&to_whiteout);
+ }
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
--
1.6.3.3

2010-04-15 23:10:54

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 26/35] union-mount: In-kernel copyup routines

When a file on the read-only layer of a union mount is altered, it
must be copied up to the topmost read-write layer. This patch creates
union_copyup() and its supporting routines.
---
fs/union.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/union.h | 7 +-
2 files changed, 252 insertions(+), 1 deletions(-)

diff --git a/fs/union.c b/fs/union.c
index e2384ad..944c720 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -26,6 +26,7 @@
#include <linux/namei.h>
#include <linux/file.h>
#include <linux/security.h>
+#include <linux/splice.h>

/*
* This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -633,3 +634,248 @@ out_fput:
mnt_drop_write(topmost_path->mnt);
return res;
}
+
+/**
+ * union_create_file
+ *
+ * @nd: namediata for source file
+ * @old: path of the source file
+ * @new: path of the new file, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's
+ * i_mutex.
+ */
+
+static int union_create_file(struct nameidata *nd, struct path *old,
+ struct dentry *new)
+{
+ struct path *parent = &nd->path;
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ return vfs_create(parent->dentry->d_inode, new,
+ old->dentry->d_inode->i_mode, nd);
+}
+
+/**
+ * union_create_symlink
+ *
+ * @nd: namediata for source symlink
+ * @old: path of the source symlink
+ * @new: path of the new symlink, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's
+ * i_mutex.
+ */
+
+static int union_create_symlink(struct nameidata *nd, struct path *old,
+ struct dentry *new)
+{
+ void *cookie;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&nd->path.dentry->d_inode->i_mutex));
+
+ printk(KERN_INFO "%s: copying up symlink\n", new->d_name.name);
+ /*
+ * We want the contents of this symlink, not to follow it, so
+ * this is modeled on generic_readlink() rather than
+ * do_follow_link().
+ */
+ nd->depth = 0;
+ cookie = old->dentry->d_inode->i_op->follow_link(old->dentry, nd);
+ if (IS_ERR(cookie))
+ return PTR_ERR(cookie);
+ /* Create a copy of the link on the top layer */
+ error = vfs_symlink(nd->path.dentry->d_inode, new,
+ nd_get_link(nd));
+ if (old->dentry->d_inode->i_op->put_link)
+ old->dentry->d_inode->i_op->put_link(old->dentry, nd, cookie);
+ return error;
+}
+
+/**
+ * union_copyup_data - Copy up len bytes of old's data to new
+ *
+ * @old: source file
+ * @new: target file
+ * @len: number of bytes to copy
+ */
+
+static int union_copyup_data(struct path *old, struct vfsmount *new_mnt,
+ struct dentry *new_dentry, size_t len)
+{
+ struct file *old_file;
+ struct file *new_file;
+ const struct cred *cred = current_cred();
+ loff_t offset = 0;
+ long bytes;
+ int error;
+
+ if (len == 0)
+ return 0;
+
+ /* Get reference to balance later fput() */
+ path_get(old);
+ old_file = dentry_open(old->dentry, old->mnt, O_RDONLY, cred);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ dget(new_dentry);
+ mntget(new_mnt);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY, cred);
+ if (IS_ERR(new_file)) {
+ error = PTR_ERR(new_file);
+ goto out_fput;
+ }
+
+ bytes = do_splice_direct(old_file, &offset, new_file, len,
+ SPLICE_F_MOVE);
+ if (bytes < 0)
+ error = bytes;
+
+ fput(new_file);
+out_fput:
+ fput(old_file);
+ return error;
+}
+
+/**
+ * union_do_copyup_path_len - Copy up a file and len bytes of data
+ *
+ * @parent: parent directory's path
+ * @path: path of file to be copied up
+ * @len: number of bytes of file data to copy up
+ *
+ * Parent's i_mutex must be held by caller. Newly copied up path is
+ * returned in @path and original is path_put().
+ */
+
+static int __union_copyup_len(struct nameidata *nd, struct path *path,
+ size_t len)
+{
+ struct path *parent = &nd->path;
+ struct dentry *dentry;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ dentry = lookup_one_len(path->dentry->d_name.name, parent->dentry,
+ path->dentry->d_name.len);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ if (dentry->d_inode) {
+ /*
+ * We raced with someone else and "lost." That's
+ * okay, they did all the work of copying up the file.
+ * Note that currently data copyup happens under the
+ * parent dir's i_mutex. If we move it outside that,
+ * we'll need some way of waiting for the data copyup
+ * to complete here.
+ */
+ error = 0;
+ goto out_newpath;
+ }
+ if (S_ISREG(path->dentry->d_inode->i_mode)) {
+ /* Create file */
+ error = union_create_file(nd, path, dentry);
+ if (error)
+ goto out_dput;
+ /* Copyup data */
+ error = union_copyup_data(path, parent->mnt, dentry, len);
+ } else {
+ BUG_ON(!S_ISLNK(path->dentry->d_inode->i_mode));
+ error = union_create_symlink(nd, path, dentry);
+ }
+ if (error) {
+ /* Most likely error: ENOSPC */
+ vfs_unlink(parent->dentry->d_inode, dentry);
+ goto out_dput;
+ }
+ /* XXX Copyup xattrs and any other dangly bits */
+out_newpath:
+ /* path_put() of original must happen before we copy in new */
+ path_put(path);
+ path->dentry = dentry;
+ path->mnt = mntget(parent->mnt);
+ return error;
+out_dput:
+ /* Don't path_put(path), let caller unwind */
+ dput(dentry);
+ return error;
+}
+
+/**
+ * union_copyup_path - Copy up a file given its path (and its parent's)
+ *
+ * @parent: parent directory's path
+ * @path: path of file to be copied up
+ * @newpath: return path of newly copied up file
+ * @copy_all: if set, copy all of the file's data and ignore @len
+ * @len: if @copy_all is not set, number of bytes of file data to copy up
+ */
+
+int do_union_copyup_len(struct nameidata *nd, struct path *path, int copy_all,
+ size_t len)
+{
+ struct path *parent = &nd->path;
+ int error;
+
+ if (!IS_UNIONED_DIR(parent))
+ return 0;
+ if (parent->mnt == path->mnt)
+ return 0;
+ if (!S_ISREG(path->dentry->d_inode->i_mode) &&
+ !S_ISLNK(path->dentry->d_inode->i_mode))
+ return 0;
+
+ BUG_ON(!S_ISDIR(parent->dentry->d_inode->i_mode));
+
+ mutex_lock(&parent->dentry->d_inode->i_mutex);
+ error = -ENOENT;
+ if (IS_DEADDIR(parent->dentry->d_inode))
+ goto out_unlock;
+
+ if (copy_all && S_ISREG(path->dentry->d_inode->i_mode)) {
+ error = -EFBIG;
+ len = i_size_read(path->dentry->d_inode);
+ if (((size_t)len != len) || ((ssize_t)len != len))
+ goto out_unlock;
+ }
+
+ error = __union_copyup_len(nd, path, len);
+
+out_unlock:
+ mutex_unlock(&parent->dentry->d_inode->i_mutex);
+ return error;
+}
+
+/*
+ * Helper function to copy up all of a file
+ */
+int union_copyup(struct nameidata *nd, struct path *path)
+{
+ return do_union_copyup_len(nd, path, 1, 0);
+}
+
+/*
+ * Unlocked helper function to copy up all of a file
+ */
+int __union_copyup(struct nameidata *nd, struct path *path)
+{
+ size_t len;
+ len = i_size_read(path->dentry->d_inode);
+ if (((size_t)len != len) || ((ssize_t)len != len))
+ return -EFBIG;
+
+ return __union_copyup_len(nd, path, len);
+}
+
+/*
+ * Helper function to copy up part of a file
+ */
+int union_copyup_len(struct nameidata *nd, struct path *path, size_t len)
+{
+ return do_union_copyup_len(nd, path, 0, len);
+}
+
diff --git a/include/linux/union.h b/include/linux/union.h
index 66deeb2..21254c6 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -52,7 +52,9 @@ extern struct dentry * union_create_topmost_dir(struct path *, struct qstr *,
extern int attach_mnt_union(struct vfsmount *, struct vfsmount *);
extern void detach_mnt_union(struct vfsmount *);
extern int union_copyup_dir(struct path *);
-
+extern int union_copyup(struct nameidata *, struct path *);
+extern int __union_copyup(struct nameidata *, struct path *);
+extern int union_copyup_len(struct nameidata *, struct path *, size_t len);
#else /* CONFIG_UNION_MOUNT */

#define IS_MNT_UNION(x) (0)
@@ -66,6 +68,9 @@ extern int union_copyup_dir(struct path *);
#define attach_mnt_union(x, y) do { } while (0)
#define detach_mnt_union(x) do { } while (0)
#define union_copyup_dir(x) ({ BUG(); (0); })
+#define union_copyup(x, y) ({ BUG(); (NULL); })
+#define __union_copyup(x, y) ({ BUG(); (NULL); })
+#define union_copyup_len(x, y, z) ({ BUG(); (NULL); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-04-15 23:11:26

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 25/35] VFS: Create user_path_nd() to lookup both parent and target

Proof-of-concept implementation of user_path_nd(). Lookup both the
parent and the target of a user-supplied filename, to supply later to
union copyup routines.
---
fs/namei.c | 31 +++++++++++++++++++++++++++++++
include/linux/namei.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 24e0cb2..68aa8ab 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1563,6 +1563,37 @@ static int user_path_parent(int dfd, const char __user *path,
return error;
}

+int user_path_nd(int dfd, const char __user *filename,
+ unsigned flags, struct nameidata *parent_nd,
+ struct path *child, char **tmp)
+{
+ struct nameidata child_nd;
+ char *s = getname(filename);
+ int error;
+
+ if (IS_ERR(s))
+ return PTR_ERR(s);
+
+ /* Lookup parent */
+ error = do_path_lookup(dfd, s, LOOKUP_PARENT, parent_nd);
+ if (error)
+ goto out_putname;
+
+ /* Lookup child - XXX optimize, racy */
+ error = do_path_lookup(dfd, s, flags, &child_nd);
+ if (error)
+ goto out_path_put;
+ *child = child_nd.path;
+ *tmp = s;
+ return 0;
+
+out_path_put:
+ path_put(&parent_nd->path);
+out_putname:
+ putname(s);
+ return error;
+}
+
/*
* It's inline, so penalty for filesystems that don't use sticky bit is
* minimal.
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..83dc8b5 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -58,6 +58,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_RENAME_TARGET 0x0800

extern int user_path_at(int, const char __user *, unsigned, struct path *);
+extern int user_path_nd(int, const char __user *, unsigned,
+ struct nameidata *, struct path *, char **);

#define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
#define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
--
1.6.3.3

2010-04-15 23:11:41

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 24/35] VFS: Split inode_permission() and create path_permission()

Split inode_permission() into inode and file-system-dependent parts.
Create path_permission() to check permission based on the path to the
inode. This is for union mounts, in which an inode can be located on
a read-only lower layer file system but is still writable, since we
will copy it up to the writable top layer file system. So in that
case, we want to ignore MS_RDONLY on the lower layer. To make this
decision, we must know the path (vfsmount, dentry) of both the target
and its parent.
---
fs/namei.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--------
include/linux/fs.h | 1 +
2 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 900df0f..24e0cb2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -241,29 +241,20 @@ int generic_permission(struct inode *inode, int mask,
}

/**
- * inode_permission - check for access rights to a given inode
+ * __inode_permission - check for access rights to a given inode
* @inode: inode to check permission on
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
* Used to check for read/write/execute permissions on an inode.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things.
+ *
+ * This does not check for a read-only file system. You probably want
+ * inode_permission().
*/
-int inode_permission(struct inode *inode, int mask)
+static int __inode_permission(struct inode *inode, int mask)
{
int retval;

if (mask & MAY_WRITE) {
- umode_t mode = inode->i_mode;
-
- /*
- * Nobody gets write access to a read-only fs.
- */
- if (IS_RDONLY(inode) &&
- (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
- return -EROFS;
-
/*
* Nobody gets write access to an immutable file.
*/
@@ -288,6 +279,79 @@ int inode_permission(struct inode *inode, int mask)
}

/**
+ * sb_permission - check superblock-level permissions
+ * @sb: superblock of inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Separate out file-system wide checks from inode-specific permission
+ * checks. In particular, union mounts want to check the read-only
+ * status of the top-level file system, not the lower.
+ */
+int sb_permission(struct super_block *sb, struct inode *inode, int mask)
+{
+ if (mask & MAY_WRITE) {
+ umode_t mode = inode->i_mode;
+
+ /*
+ * Nobody gets write access to a read-only fs.
+ */
+ if ((sb->s_flags & MS_RDONLY) &&
+ (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+ return -EROFS;
+ }
+ return 0;
+}
+
+/**
+ * inode_permission - check for access rights to a given inode
+ * @inode: inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Used to check for read/write/execute permissions on an inode.
+ * We use "fsuid" for this, letting us set arbitrary permissions
+ * for filesystem access without changing the "normal" uids which
+ * are used for other things.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+ int retval;
+
+ retval = sb_permission(inode->i_sb, inode, mask);
+ if (retval)
+ return retval;
+ return __inode_permission(inode, mask);
+}
+
+/**
+ * path_permission - check for inode access rights depending on path
+ * @path: path of inode to check
+ * @parent_path: path of inode's parent
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Like inode_permission, but used to check for permission when the
+ * file may potentially be copied up between union layers.
+ */
+
+int path_permission(struct path *path, struct path *parent_path, int mask)
+{
+ struct vfsmount *mnt;
+ int retval;
+
+ /* Catch some reversal of args */
+ BUG_ON(!S_ISDIR(parent_path->dentry->d_inode->i_mode));
+
+ if (IS_MNT_UNION(parent_path->mnt))
+ mnt = parent_path->mnt;
+ else
+ mnt = path->mnt;
+
+ retval = sb_permission(mnt->mnt_sb, path->dentry->d_inode, mask);
+ if (retval)
+ return retval;
+ return __inode_permission(path->dentry->d_inode, mask);
+}
+
+/**
* file_permission - check for additional access rights to a given file
* @file: file to check access rights for
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4dae882..0b1c811 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2114,6 +2114,7 @@ extern sector_t bmap(struct inode *, sector_t);
#endif
extern int notify_change(struct dentry *, struct iattr *);
extern int inode_permission(struct inode *, int);
+extern int path_permission(struct path *, struct path *, int);
extern int generic_permission(struct inode *, int,
int (*check_acl)(struct inode *, int));

--
1.6.3.3

2010-04-15 23:11:56

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 23/35] union-mount: Copy up directory entries on first readdir()

readdir() in union mounts is implemented by copying up all visible
directory entries from the lower level directories to the topmost
directory. Directory entries that refer to lower level file system
objects are marked as "fallthru" in the topmost directory.

Thanks to Felix Fietkau <[email protected]> for a bug fix.

Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Felix Fietkau <[email protected]>
---
fs/readdir.c | 9 +++
fs/union.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/union.h | 2 +
3 files changed, 171 insertions(+), 0 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 3a48491..da71515 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,6 +16,8 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/unistd.h>
+#include <linux/union.h>
+#include <linux/mount.h>

#include <asm/uaccess.h>

@@ -36,9 +38,16 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
+ if (IS_UNIONED_DIR(&file->f_path) && !IS_OPAQUE(inode)) {
+ res = union_copyup_dir(&file->f_path);
+ if (res)
+ goto out_unlock;
+ }
+
res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
+out_unlock:
mutex_unlock(&inode->i_mutex);
out:
return res;
diff --git a/fs/union.c b/fs/union.c
index 8ad9de7..e2384ad 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -5,6 +5,7 @@
* Copyright (C) 2007-2009 Novell Inc.
*
* Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the Free
@@ -23,6 +24,8 @@
#include <linux/slab.h>
#include <linux/union.h>
#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/security.h>

/*
* This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -473,3 +476,160 @@ out:
mnt_drop_write(parent->mnt);
return dentry;
}
+
+/**
+ * union_copyup_dir_one - copy up a single directory entry
+ *
+ * Individual directory entry copyup function for union_copyup_dir.
+ * We get the entries from higher level layers first.
+ */
+
+static int union_copyup_dir_one(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct dentry *topmost_dentry = (struct dentry *) buf;
+ struct dentry *dentry;
+ int err = 0;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ /* Lookup this entry in the topmost directory */
+ dentry = lookup_one_len(name, topmost_dentry, namlen);
+
+ if (IS_ERR(dentry)) {
+ printk(KERN_WARNING "%s: error looking up %s\n", __func__,
+ dentry->d_name.name);
+ err = PTR_ERR(dentry);
+ goto out;
+ }
+
+ /*
+ * If the entry already exists, one of the following is true:
+ * it was already copied up (due to an earlier lookup), an
+ * entry with the same name already exists on the topmost file
+ * system, it is a whiteout, or it is a fallthru. In each
+ * case, the top level entry masks any entries from lower file
+ * systems, so don't copy up this entry.
+ */
+ if (dentry->d_inode || d_is_whiteout(dentry) || d_is_fallthru(dentry))
+ goto out_dput;
+
+ /*
+ * If the entry doesn't exist, create a fallthru entry in the
+ * topmost file system. All possible directory types are
+ * used, so each file system must implement its own way of
+ * storing a fallthru entry.
+ */
+ err = topmost_dentry->d_inode->i_op->fallthru(topmost_dentry->d_inode,
+ dentry);
+out_dput:
+ dput(dentry);
+out:
+ return err;
+}
+
+/**
+ * union_copyup_dir - copy up low-level directory entries to topmost dir
+ *
+ * readdir() is difficult to support on union file systems for two
+ * reasons: We must eliminate duplicates and apply whiteouts, and we
+ * must return something in f_pos that lets us restart in the same
+ * place when we return. Our solution is to, on first readdir() of
+ * the directory, copy up all visible entries from the low-level file
+ * systems and mark the entries that refer to low-level file system
+ * objects as "fallthru" entries.
+ *
+ * Locking strategy: We hold the topmost dir's i_mutex on entry. We
+ * grab the i_mutex on lower directories one by one. So the locking
+ * order is:
+ *
+ * Writable/topmost layers > Read-only/lower layers
+ *
+ * So there is no problem with lock ordering for union stacks with
+ * multiple lower layers. E.g.:
+ *
+ * (topmost) A->B->C (bottom)
+ * (topmost) D->C->B (bottom)
+ *
+ * (Not that we support more than two layers at the moment.)
+ */
+
+int union_copyup_dir(struct path *topmost_path)
+{
+ struct dentry *topmost_dentry = topmost_path->dentry;
+ struct path path = *topmost_path;
+ int res = 0;
+
+ BUG_ON(IS_OPAQUE(topmost_dentry->d_inode));
+
+ res = mnt_want_write(topmost_path->mnt);
+ if (res)
+ return res;
+ /*
+ * Mark this dir opaque to show that we have already copied up
+ * the lower entries. Only fallthru entries pass through to
+ * the underlying file system.
+ */
+ topmost_dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(topmost_dentry->d_inode);
+
+ path_get(&path);
+ while (union_down_one(&path.mnt, &path.dentry)) {
+ struct file * ftmp;
+ struct inode * inode;
+
+ /* XXX Permit fallthrus on lower-level? Would need to
+ * pass in opaque flag to union_copyup_dir_one() and
+ * only copy up fallthru entries there. We allow
+ * fallthrus in lower level opaque directories on
+ * lookup, so for consistency we should do one or the
+ * other in both places. */
+ if (IS_OPAQUE(path.dentry->d_inode))
+ break;
+
+ /* dentry_open() doesn't get a path reference itself */
+ path_get(&path);
+ ftmp = dentry_open(path.dentry, path.mnt,
+ O_RDONLY | O_DIRECTORY | O_NOATIME,
+ current_cred());
+ if (IS_ERR(ftmp)) {
+ printk (KERN_ERR "unable to open dir %s for "
+ "directory copyup: %ld\n",
+ path.dentry->d_name.name, PTR_ERR(ftmp));
+ path_put(&path);
+ continue;
+ }
+
+ inode = path.dentry->d_inode;
+ mutex_lock(&inode->i_mutex);
+
+ res = -ENOENT;
+ if (IS_DEADDIR(inode))
+ goto out_fput;
+ /*
+ * Read the whole directory, calling our directory
+ * entry copyup function on each entry. Pass in the
+ * topmost dentry as our private data so we can create
+ * new entries in the topmost directory.
+ */
+ res = ftmp->f_op->readdir(ftmp, topmost_dentry,
+ union_copyup_dir_one);
+out_fput:
+ mutex_unlock(&inode->i_mutex);
+ fput(ftmp);
+
+ if (res)
+ break;
+ }
+ path_put(&path);
+ mnt_drop_write(topmost_path->mnt);
+ return res;
+}
diff --git a/include/linux/union.h b/include/linux/union.h
index 189a84d..66deeb2 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -51,6 +51,7 @@ extern struct dentry * union_create_topmost_dir(struct path *, struct qstr *,
struct path *);
extern int attach_mnt_union(struct vfsmount *, struct vfsmount *);
extern void detach_mnt_union(struct vfsmount *);
+extern int union_copyup_dir(struct path *);

#else /* CONFIG_UNION_MOUNT */

@@ -64,6 +65,7 @@ extern void detach_mnt_union(struct vfsmount *);
#define union_create_topmost_dir(x, y, z) ({ BUG(); (NULL); })
#define attach_mnt_union(x, y) do { } while (0)
#define detach_mnt_union(x) do { } while (0)
+#define union_copyup_dir(x) ({ BUG(); (0); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-04-15 23:12:15

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 20/35] union-mount: Implement union lookup

Implement unioned directories, whiteouts, and fallthrus in pathname
lookup routines. do_lookup() and lookup_hash() call lookup_union()
after looking up the dentry from the top-level file system.
lookup_union() is centered around __lookup_hash(), which does cached
and/or real lookups and revalidates each dentry in the union stack.

The added cost to a non-union mount pathname lookup in a
CONFIG_UNION_MOUNT kernel is either one or two mount flag tests per
pathname component, in needs_union_lookup().

XXX - implement negative union cache entries
---
fs/namei.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/union.c | 67 +++++++++++++++++
include/linux/union.h | 9 +++
3 files changed, 266 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 991767b..b179062 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -32,6 +32,7 @@
#include <linux/fcntl.h>
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
+#include <linux/union.h>
#include <asm/uaccess.h>

#include "internal.h"
@@ -722,6 +723,181 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
follow_mount(&nd->path);
}

+static struct dentry *__lookup_hash(struct qstr *name, struct dentry *base,
+ struct nameidata *nd);
+
+/*
+ * __lookup_union - Given a path from the topmost layer, lookup and
+ * revalidate each dentry in its union stack, building it if necessary
+ *
+ * @nd - nameidata for the parent of @topmost
+ * @name - pathname from this element on
+ * @topmost - path of the topmost matching dentry
+ *
+ * Given the nameidata and the path of the topmost dentry for this
+ * pathname, lookup, revalidate, and build the associated union stack.
+ * @topmost must be either a negative dentry or a directory.
+ *
+ * This function is called both to build a new union stack and to
+ * revalidate a pre-existing union stack. So we must cope with
+ * already existing union cache entries.
+ *
+ * This function may stomp nd->path with the path of the parent
+ * directory of lower layer, so the caller must save nd->path and
+ * restore it afterwards. You probably want to use lookup_union(),
+ * not __lookup_union().
+ */
+
+static int __lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path parent = nd->path;
+ struct dentry *dentry;
+ struct path upper;
+ struct path lower;
+ int err = 0;
+
+ if (d_is_whiteout(topmost->dentry))
+ return 0;
+
+ if (IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(topmost->dentry))
+ return 0;
+
+ /* upper is the most recent positive dentry or topmost negative */
+ upper.dentry = dget(topmost->dentry);
+ upper.mnt = mntget(topmost->mnt);
+
+ /* union_down_one() drops a reference, take one */
+ path_get(&nd->path);
+
+ /* Traverse the parent dir's union stack looking for this name */
+ while (union_down_one(&nd->path.mnt, &nd->path.dentry)) {
+ /* Lookup and revalidate the child dentry */
+ lower.mnt = nd->path.mnt;
+ lower.dentry = __lookup_hash(name, nd->path.dentry, nd);
+
+ if (IS_ERR(lower.dentry)) {
+ err = PTR_ERR(lower.dentry);
+ break;
+ }
+
+ if (d_is_whiteout(lower.dentry)) {
+ dput(lower.dentry);
+ break;
+ }
+
+ if (IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(lower.dentry))
+ break;
+
+ if (!lower.dentry->d_inode) {
+ dput(lower.dentry);
+ continue;
+ }
+
+ /*
+ * You can't union a file with a directory! Note that
+ * if the topmost directory entry is positive, then it
+ * will be a directory at this point.
+ */
+ if (topmost->dentry->d_inode &&
+ !S_ISDIR(lower.dentry->d_inode->i_mode)) {
+ dput(lower.dentry);
+ break;
+ }
+
+ /* Non-dir entries block anything below, so bail out */
+ if (!S_ISDIR(lower.dentry->d_inode->i_mode)) {
+ dput(topmost->dentry);
+ topmost->dentry = lower.dentry;
+ /* mntput() of previous topmost done in link_path_walk() */
+ topmost->mnt = mntget(lower.mnt);
+ break;
+ }
+
+ /* The topmost directory must always exist. Create if necessary. */
+ if (!topmost->dentry->d_inode) {
+ dentry = union_create_topmost_dir(&parent, name, &lower);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ dput(lower.dentry);
+ break;
+ }
+ dput(topmost->dentry);
+ topmost->dentry = dentry;
+ dput(upper.dentry);
+ upper.dentry = dget(dentry);
+ }
+
+ /* Add new dentry to the union stack (handles already-created case) */
+ err = append_to_union(&upper, &lower);
+ if (err) {
+ dput(lower.dentry);
+ break;
+ }
+
+ path_put(&upper);
+ upper.mnt = mntget(lower.mnt);
+ upper.dentry = lower.dentry;
+ }
+ path_put(&nd->path);
+ path_put(&upper);
+
+ return err;
+}
+
+/*
+ * lookup_union - revalidate and build union stack for this path
+ *
+ * We borrow the nameidata struct from the topmost layer to do the
+ * revalidation on lower dentries, replacing the topmost parent
+ * directory's path with that of the matching parent dir in each lower
+ * layer. This wrapper for __lookup_union() saves the topmost layer's
+ * path and restores it when we are done.
+ */
+static int lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path saved_path;
+ int err;
+
+ BUG_ON(!IS_MNT_UNION(nd->path.mnt) && !IS_MNT_UNION(topmost->mnt));
+ BUG_ON(!mutex_is_locked(&nd->path.dentry->d_inode->i_mutex));
+
+ saved_path = nd->path;
+ path_get(&saved_path);
+
+ err = __lookup_union(nd, name, topmost);
+
+ nd->path = saved_path;
+ path_put(&saved_path);
+
+ return err;
+}
+
+/*
+ * do_union_lookup - union mount-aware part of do_lookup
+ *
+ * do_lookup()-style wrapper for lookup_union(). Follows mounts.
+ */
+
+static int do_union_lookup(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct dentry *parent = nd->path.dentry;
+ struct inode *dir = parent->d_inode;
+ int err;
+
+ mutex_lock(&dir->i_mutex);
+ err = lookup_union(nd, name, topmost);
+ mutex_unlock(&dir->i_mutex);
+
+ __follow_mount(topmost);
+
+ return err;
+}
+
/*
* It's more convoluted than I'd like it to be, but... it's still fairly
* small and for now I'd prefer to have fast path as straight as possible.
@@ -752,6 +928,11 @@ done:
path->mnt = mnt;
path->dentry = dentry;
__follow_mount(path);
+ if (needs_union_lookup(nd->path.mnt, path)) {
+ int err = do_union_lookup(nd, name, path);
+ if (err < 0)
+ return err;
+ }
return 0;

need_lookup:
@@ -1223,8 +1404,13 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
err = PTR_ERR(path->dentry);
path->dentry = NULL;
path->mnt = NULL;
+ return err;
}
+
+ if (needs_union_lookup(nd->path.mnt, path))
+ err = lookup_union(nd, name, path);
return err;
+
}

static int __lookup_one_len(const char *name, struct qstr *this,
@@ -2945,7 +3131,10 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EXDEV;
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
-
+ /* Rename on union mounts not implemented yet */
+ /* XXX much harsher check than necessary - can do some renames */
+ if (IS_UNIONED_DIR(&oldnd.path) || IS_UNIONED_DIR(&newnd.path))
+ goto exit2;
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
diff --git a/fs/union.c b/fs/union.c
index 4168b62..5011d26 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -22,6 +22,7 @@
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/union.h>
+#include <linux/namei.h>

/*
* This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -196,6 +197,43 @@ static struct union_mount *union_cache_rlookup(struct dentry *dentry, struct vfs
}

/*
+ * needs_union_lookup - Does this path need a union lookup?
+ *
+ * @parent_mnt - parent mnt, usually from associated nameidata (nd->path.mnt)
+ * @path - path of potential child union directory
+ *
+ * Short-circuit union operations on paths that can't possibly be
+ * unioned directories or don't need union lookup.
+ */
+
+int needs_union_lookup(struct vfsmount *parent_mnt, struct path *path)
+{
+ /* If this is the root of a mount, ignore the parent */
+ if (IS_ROOT(path->dentry) && !IS_MNT_UNION(path->mnt))
+ return 0;
+
+ /* The child could be from a lower layer, check the parent mnt */
+ if (!IS_MNT_UNION(parent_mnt))
+ return 0;
+
+ /* Only directories can be unioned */
+ if (path->dentry->d_inode &&
+ !S_ISDIR(path->dentry->d_inode->i_mode))
+ return 0;
+
+ /*
+ * XXX - A negative dentry for a directory in a unioned
+ * directory could have a matching directory below it. Or it
+ * could not. Either way, all we have is a negative dentry.
+ * As a result, negative dentries with unioned parents always
+ * have to go through a full union lookup. This can be
+ * avoided by adding a negative union cache entry for the
+ * negative dentry.
+ */
+ return 1;
+}
+
+/*
* append_to_union - add a path to the bottom of the union stack
*
* Allocate and attach a union cache entry linking the new, upper
@@ -343,3 +381,32 @@ repeat:
}
spin_unlock(&union_lock);
}
+
+/*
+ * union_create_topmost_dir - Create a matching dir in the topmost file system
+ */
+
+struct dentry * union_create_topmost_dir(struct path *parent, struct qstr *name,
+ struct path *lower)
+{
+ struct dentry *dentry;
+ int mode = lower->dentry->d_inode->i_mode;
+ int res;
+
+ res = mnt_want_write(parent->mnt);
+ if (res)
+ return ERR_PTR(res);
+
+ dentry = lookup_one_len(name->name, parent->dentry, name->len);
+ if (IS_ERR(dentry))
+ goto out;
+
+ res = vfs_mkdir(parent->dentry->d_inode, dentry, mode);
+ if (res) {
+ dput(dentry);
+ goto out;
+ }
+out:
+ mnt_drop_write(parent->mnt);
+ return dentry;
+}
diff --git a/include/linux/union.h b/include/linux/union.h
index c5ead54..681b472 100644
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -38,19 +38,28 @@ struct union_mount {
};

#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+#define IS_UNIONED_DIR(path) (IS_MNT_UNION((path)->mnt) && \
+ ((path)->dentry->d_unionized || \
+ !list_empty(&(path)->dentry->d_unions)))

+extern int needs_union_lookup(struct vfsmount *, struct path *);
extern int append_to_union(struct path *, struct path*);
extern int union_down_one(struct vfsmount **, struct dentry **);
extern void __d_drop_unions(struct dentry *);
extern void shrink_d_unions(struct dentry *);
+extern struct dentry * union_create_topmost_dir(struct path *, struct qstr *,
+ struct path *);

#else /* CONFIG_UNION_MOUNT */

#define IS_MNT_UNION(x) (0)
+#define IS_UNIONED_DIR(x) (0)
+#define needs_union_lookup(x, y) ({ (0); })
#define append_to_union(x, y) ({ BUG(); (0); })
#define union_down_one(x, y) ({ (0); })
#define __d_drop_unions(x) do { } while (0)
#define shrink_d_unions(x) do { } while (0)
+#define union_create_topmost_dir(x, y, z) ({ BUG(); (NULL); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-04-15 23:06:35

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 11/35] whiteout: jffs2 whiteout support

From: Felix Fietkau <[email protected]>

Add support for whiteout dentries to jffs2.

Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
---
fs/jffs2/dir.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/jffs2/fs.c | 4 +++
fs/jffs2/super.c | 2 +-
include/linux/jffs2.h | 2 +
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 7aa4417..c259193 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

+static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+
const struct file_operations jffs2_dir_operations =
{
.read = generic_read_dir,
@@ -56,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.mknod = jffs2_mknod,
.rename = jffs2_rename,
.check_acl = jffs2_check_acl,
+ .whiteout = jffs2_whiteout,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -98,8 +101,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
fd = fd_list;
}
}
- if (fd)
- ino = fd->ino;
+ if (fd) {
+ spin_lock(&target->d_lock);
+ if (fd->type == DT_WHT)
+ target->d_flags |= DCACHE_WHITEOUT;
+ else
+ ino = fd->ino;
+ spin_unlock(&target->d_lock);
+ }
mutex_unlock(&dir_f->sem);
if (ino) {
inode = jffs2_iget(dir_i->i_sb, ino);
@@ -498,6 +507,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, int mode)
return PTR_ERR(inode);
}

+ if (dentry->d_flags & DCACHE_WHITEOUT) {
+ inode->i_flags |= S_OPAQUE;
+ ri->flags = cpu_to_je16(JFFS2_INO_FLAG_OPAQUE);
+ }
+
inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;

@@ -779,6 +793,60 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return 0;
}

+static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ struct jffs2_inode_info *victim_f = NULL;
+ uint32_t now;
+ int ret;
+
+ /* If it's a directory, then check whether it is really empty */
+ if (new_dentry->d_inode) {
+ victim_f = JFFS2_INODE_INFO(old_dentry->d_inode);
+ if (S_ISDIR(old_dentry->d_inode->i_mode)) {
+ struct jffs2_full_dirent *fd;
+
+ mutex_lock(&victim_f->sem);
+ for (fd = victim_f->dents; fd; fd = fd->next) {
+ if (fd->ino) {
+ mutex_unlock(&victim_f->sem);
+ return -ENOTEMPTY;
+ }
+ }
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_WHT,
+ new_dentry->d_name.name, new_dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (victim_f) {
+ /* There was a victim. Kill it off nicely */
+ drop_nlink(old_dentry->d_inode);
+ /* Don't oops if the victim was a dirent pointing to an
+ inode which didn't exist. */
+ if (victim_f->inocache) {
+ mutex_lock(&victim_f->sem);
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ victim_f->inocache->pino_nlink = 0;
+ else
+ victim_f->inocache->pino_nlink--;
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ return 0;
+}
+
static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
struct inode *new_dir_i, struct dentry *new_dentry)
{
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 3451a81..c1e333c 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -301,6 +301,10 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned long ino)

inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;
+
+ if (je16_to_cpu(latest_node.flags) & JFFS2_INO_FLAG_OPAQUE)
+ inode->i_flags |= S_OPAQUE;
+
break;
}
case S_IFREG:
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 9a80e8e..c12cd1c 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -172,7 +172,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)

sb->s_op = &jffs2_super_operations;
sb->s_export_op = &jffs2_export_ops;
- sb->s_flags = sb->s_flags | MS_NOATIME;
+ sb->s_flags = sb->s_flags | MS_NOATIME | MS_WHITEOUT;
sb->s_xattr = jffs2_xattr_handlers;
#ifdef CONFIG_JFFS2_FS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 2b32d63..65533bb 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -87,6 +87,8 @@
#define JFFS2_INO_FLAG_USERCOMPR 2 /* User has requested a specific
compression type */

+#define JFFS2_INO_FLAG_OPAQUE 4 /* Directory is opaque (for union mounts) */
+

/* These can go once we've made sure we've caught all uses without
byteswapping */
--
1.6.3.3

2010-04-15 23:06:51

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 14/35] fallthru: jffs2 fallthru support

From: Felix Fietkau <[email protected]>

Add support for fallthru dentries to jffs2.

Cc: David Woodhouse <[email protected]>
Cc: [email protected]
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/jffs2/dir.c | 36 +++++++++++++++++++++++++++++++++---
include/linux/jffs2.h | 6 ++++++
2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index c259193..98397b3 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -35,6 +35,7 @@ static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+static int jffs2_fallthru (struct inode *, struct dentry *);

const struct file_operations jffs2_dir_operations =
{
@@ -59,6 +60,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rename = jffs2_rename,
.check_acl = jffs2_check_acl,
.whiteout = jffs2_whiteout,
+ .fallthru = jffs2_fallthru,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -103,10 +105,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
}
if (fd) {
spin_lock(&target->d_lock);
- if (fd->type == DT_WHT)
+ switch (fd->type) {
+ case DT_WHT:
target->d_flags |= DCACHE_WHITEOUT;
- else
+ case JFFS2_DT_FALLTHRU:
+ target->d_flags |= DCACHE_FALLTHRU;
+ default:
ino = fd->ino;
+ }
spin_unlock(&target->d_lock);
}
mutex_unlock(&dir_f->sem);
@@ -164,7 +170,10 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
fd->name, fd->ino, fd->type, curofs, offset));
continue;
}
- if (!fd->ino) {
+ if (fd->type == JFFS2_DT_FALLTHRU)
+ /* XXX Should really do a lookup for the real inode number here */
+ fd->ino = 100;
+ else if (!fd->ino && (fd->type != DT_WHT)) {
D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
offset++;
continue;
@@ -793,6 +802,26 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return 0;
}

+static int jffs2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ uint32_t now;
+ int ret;
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_UNKNOWN,
+ dentry->d_name.name, dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+
+ return 0;
+}
+
static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
{
@@ -825,6 +854,7 @@ static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
return ret;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 65533bb..dbe8c93 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -114,6 +114,12 @@ struct jffs2_unknown_node
jint32_t hdr_crc;
};

+/*
+ * Non-standard directory entry type(s), for on-disk use
+ */
+
+#define JFFS2_DT_FALLTHRU (DT_WHT + 1)
+
struct jffs2_raw_dirent
{
jint16_t magic;
--
1.6.3.3

2010-04-15 23:12:56

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 13/35] fallthru: ext2 fallthru support

Add support for fallthru directory entries to ext2.

XXX - Makes up inode number for fallthru entry
XXX - Might be better implemented as special symlinks

Cc: Theodore Tso <[email protected]>
Cc: [email protected]
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 1 +
fs/ext2/namei.c | 22 +++++++++++
include/linux/ext2_fs.h | 1 +
4 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 030bd46..f3b4aff 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode && (de->file_type != EXT2_FT_WHT))
+ if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU)))
return 0;
return !memcmp(name, de->name, len);
}
@@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
[EXT2_FT_WHT] = DT_WHT,
+ [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
};

#define S_SHIFT 12
@@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
ext2_put_page(page);
return 0;
}
+ } else if (de->file_type == EXT2_FT_FALLTHRU) {
+ int over;
+ unsigned char d_type = DT_UNKNOWN;
+
+ offset = (char *)de - kaddr;
+ /* XXX We don't know the inode number
+ * of the directory entry in the
+ * underlying file system. Should
+ * look it up, either on fallthru
+ * creation at first readdir or now at
+ * filldir time. */
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ 123 /* Made up ino */, d_type);
+ if (over) {
+ ext2_put_page(page);
+ return 0;
+ }
}
filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
@@ -463,6 +483,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&dentry->d_lock);
+ } else if(!res && de->file_type == EXT2_FT_FALLTHRU) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
}
ext2_put_page(page);
}
@@ -532,6 +556,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
+ de->file_type = 0;
goto got_it;
}
if (de->rec_len == 0) {
@@ -545,6 +570,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU) &&
(rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
@@ -587,7 +613,8 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)

err = -EEXIST;
if (ext2_match (namelen, name, de)) {
- if (de->file_type == EXT2_FT_WHT)
+ if ((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU))
goto got_it;
goto out_unlock;
}
@@ -602,7 +629,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
!ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
@@ -627,6 +655,60 @@ out_unlock:
}

/*
+ * Create a fallthru entry.
+ */
+int ext2_fallthru_entry (struct inode *dir, struct dentry *dentry)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ if (de->inode || (de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_FALLTHRU;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
+/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
*/
@@ -711,7 +793,9 @@ int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
*/
if (ext2_match (namelen, name, de))
de->inode = 0;
- if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 44d190c..2fa32b3 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -108,6 +108,7 @@ extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *,
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
extern int ext2_whiteout_entry (struct inode *, struct dentry *,
struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_fallthru_entry (struct inode *, struct dentry *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 12195a5..f28154c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -349,6 +349,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
goto out;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
@@ -367,6 +368,26 @@ out:
return err;
}

+/*
+ * Create a fallthru entry.
+ */
+static int ext2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ int err;
+
+ dquot_initialize(dir);
+
+ err = ext2_fallthru_entry(dir, dentry);
+ if (err)
+ return err;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -470,6 +491,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
.whiteout = ext2_whiteout,
+ .fallthru = ext2_fallthru,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 20468bd..cb3d400 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -577,6 +577,7 @@ enum {
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
EXT2_FT_WHT = 8,
+ EXT2_FT_FALLTHRU = 9,
EXT2_FT_MAX
};

--
1.6.3.3

2010-04-15 23:13:16

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 12/35] fallthru: Basic fallthru definitions

Define the fallthru dcache flag and file system op. Mask out the
DCACHE_FALLTHRU flag on dentry creation. Actual users and changes to
lookup come in later patches.

Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/vfs.txt | 6 ++++++
fs/dcache.c | 2 +-
include/linux/dcache.h | 6 ++++++
include/linux/fs.h | 1 +
4 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8846b4f..29f3476 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -320,6 +320,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -390,6 +391,11 @@ otherwise noted.
second is the dentry for the whiteout itself. This method
must unlink() or rmdir() the original entry if it exists.

+ fallthru: called by the readdir(2) system call on a layered file
+ system. Only required if you want to support fallthrus.
+ Fallthrus are place-holders for directory entries visible from
+ a lower level file system.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.

diff --git a/fs/dcache.c b/fs/dcache.c
index 3b0e525..b76f9e4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -993,7 +993,7 @@ EXPORT_SYMBOL(d_alloc_name);
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
if (inode) {
- dentry->d_flags &= ~DCACHE_WHITEOUT;
+ dentry->d_flags &= ~(DCACHE_WHITEOUT|DCACHE_FALLTHRU);
list_add(&dentry->d_alias, &inode->i_dentry);
}
dentry->d_inode = inode;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7648b49..e035c51 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -184,6 +184,7 @@ d_iput: no no no yes

#define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
#define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */
+#define DCACHE_FALLTHRU 0x0100 /* Keep looking in the file system below */

#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */

@@ -364,6 +365,11 @@ static inline int d_is_whiteout(struct dentry *dentry)
return (dentry->d_flags & DCACHE_WHITEOUT);
}

+static inline int d_is_fallthru(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_FALLTHRU);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a9f747c..a5ba718 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1532,6 +1532,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2010-04-15 23:13:34

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 10/35] whiteout: ext2 whiteout support

From: Jan Blunck <[email protected]>

This patch adds whiteout support to EXT2. A whiteout is an empty directory
entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
allocates space in directories. Due to being implemented as a filetype it is
necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.

XXX - Whiteouts could be implemented as special symbolic links

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 96 +++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 3 +
fs/ext2/inode.c | 11 ++++-
fs/ext2/namei.c | 67 +++++++++++++++++++++++++++++++-
fs/ext2/super.c | 6 +++
include/linux/ext2_fs.h | 4 ++
6 files changed, 177 insertions(+), 10 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 57207a9..030bd46 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT))
return 0;
return !memcmp(name, de->name, len);
}
@@ -255,6 +255,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};

#define S_SHIFT 12
@@ -448,6 +449,26 @@ ino_t ext2_inode_by_name(struct inode *dir, struct qstr *child)
return res;
}

+/* Special version for filetype based whiteout support */
+ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct ext2_dir_entry_2 *de;
+ struct page *page;
+
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (de) {
+ res = le32_to_cpu(de->inode);
+ if (!res && de->file_type == EXT2_FT_WHT) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&dentry->d_lock);
+ }
+ ext2_put_page(page);
+ }
+ return res;
+}
+
/* Releases the page */
void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode, int update_times)
@@ -523,7 +544,8 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
- if (!de->inode && rec_len >= reclen)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
goto got_it;
@@ -564,8 +586,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
return PTR_ERR(de);

err = -EEXIST;
- if (ext2_match (namelen, name, de))
+ if (ext2_match (namelen, name, de)) {
+ if (de->file_type == EXT2_FT_WHT)
+ goto got_it;
goto out_unlock;
+ }

got_it:
name_len = EXT2_DIR_REC_LEN(de->name_len);
@@ -577,7 +602,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode) {
+ if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
@@ -646,6 +672,68 @@ out:
return err;
}

+int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
+ struct ext2_dir_entry_2 * de, struct page * page)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ loff_t pos;
+ int err;
+
+ if (!de) {
+ de = ext2_append_entry(dentry, &page);
+ BUG_ON(!de);
+ }
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de) &&
+ (de->file_type == EXT2_FT_WHT)) {
+ ext2_error(dir->i_sb, __func__,
+ "entry is already a whiteout in directory #%lu",
+ dir->i_ino);
+ goto out_unlock;
+ }
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ /*
+ * We whiteout an existing entry. Do what ext2_delete_entry() would do,
+ * except that we don't need to merge with the previous entry since
+ * we are going to reuse it.
+ */
+ if (ext2_match (namelen, name, de))
+ de->inode = 0;
+ if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_WHT;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
/*
* Set the first fragment of directory.
*/
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 0b038e4..44d190c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -102,9 +102,12 @@ extern void ext2_rsv_window_add(struct super_block *sb, struct ext2_reserve_wind
/* dir.c */
extern int ext2_add_link (struct dentry *, struct inode *);
extern ino_t ext2_inode_by_name(struct inode *, struct qstr *);
+extern ino_t ext2_inode_by_dentry(struct inode *, struct dentry *);
extern int ext2_make_empty(struct inode *, struct inode *);
extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *, struct page **);
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_whiteout_entry (struct inode *, struct dentry *,
+ struct ext2_dir_entry_2 *, struct page *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index fc13cc1..5ad2cbb 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1184,7 +1184,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|
+ S_OPAQUE);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1195,6 +1196,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT2_OPAQUE_FL)
+ inode->i_flags |= S_OPAQUE;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
@@ -1202,8 +1205,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
{
unsigned int flags = ei->vfs_inode.i_flags;

- ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|
- EXT2_IMMUTABLE_FL|EXT2_NOATIME_FL|EXT2_DIRSYNC_FL);
+ ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|EXT2_IMMUTABLE_FL|
+ EXT2_NOATIME_FL|EXT2_DIRSYNC_FL|EXT2_OPAQUE_FL);
if (flags & S_SYNC)
ei->i_flags |= EXT2_SYNC_FL;
if (flags & S_APPEND)
@@ -1214,6 +1217,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
ei->i_flags |= EXT2_NOATIME_FL;
if (flags & S_DIRSYNC)
ei->i_flags |= EXT2_DIRSYNC_FL;
+ if (flags & S_OPAQUE)
+ ei->i_flags |= EXT2_OPAQUE_FL;
}

struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..12195a5 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -55,15 +55,16 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
* Methods themselves.
*/

-static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
+static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry,
+ struct nameidata *nd)
{
struct inode * inode;
ino_t ino;
-
+
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);

- ino = ext2_inode_by_name(dir, &dentry->d_name);
+ ino = ext2_inode_by_dentry(dir, dentry);
inode = NULL;
if (ino) {
inode = ext2_iget(dir->i_sb, ino);
@@ -242,6 +243,10 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode)
else
inode->i_mapping->a_ops = &ext2_aops;

+ /* if we call mkdir on a whiteout create an opaque directory */
+ if (dentry->d_flags & DCACHE_WHITEOUT)
+ inode->i_flags |= S_OPAQUE;
+
inode_inc_link_count(inode);

err = ext2_make_empty(inode, dir);
@@ -307,6 +312,61 @@ static int ext2_rmdir (struct inode * dir, struct dentry *dentry)
return err;
}

+/*
+ * Create a whiteout for the dentry
+ */
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
+ struct dentry *new_dentry)
+{
+ struct inode * inode = dentry->d_inode;
+ struct ext2_dir_entry_2 * de = NULL;
+ struct page * page;
+ int err = -ENOTEMPTY;
+
+ if (!EXT2_HAS_INCOMPAT_FEATURE(dir->i_sb,
+ EXT2_FEATURE_INCOMPAT_FILETYPE)) {
+ ext2_error (dir->i_sb, "ext2_whiteout",
+ "can't set whiteout filetype");
+ err = -EPERM;
+ goto out;
+ }
+
+ dquot_initialize(dir);
+
+ if (inode) {
+ if (S_ISDIR(inode->i_mode) && !ext2_empty_dir(inode))
+ goto out;
+
+ err = -ENOENT;
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (!de)
+ goto out;
+ lock_page(page);
+ }
+
+ err = ext2_whiteout_entry (dir, dentry, de, page);
+ if (err)
+ goto out;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (inode) {
+ inode->i_ctime = dir->i_ctime;
+ inode_dec_link_count(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ inode->i_size = 0;
+ inode_dec_link_count(inode);
+ inode_dec_link_count(dir);
+ }
+ }
+ err = 0;
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -409,6 +469,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 42e4a30..000ee17 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1079,6 +1079,12 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
ext2_msg(sb, KERN_WARNING,
"warning: mounting ext3 filesystem as ext2");
+ /*
+ * Whiteouts (and fallthrus) require explicit whiteout support.
+ */
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
+ sb->s_flags |= MS_WHITEOUT;
+
ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
return 0;

diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 2dfa707..20468bd 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -189,6 +189,7 @@ struct ext2_group_desc
#define EXT2_NOTAIL_FL FS_NOTAIL_FL /* file tail should not be merged */
#define EXT2_DIRSYNC_FL FS_DIRSYNC_FL /* dirsync behaviour (directories only) */
#define EXT2_TOPDIR_FL FS_TOPDIR_FL /* Top of directory hierarchies*/
+#define EXT2_OPAQUE_FL 0x00040000
#define EXT2_RESERVED_FL FS_RESERVED_FL /* reserved for ext2 lib */

#define EXT2_FL_USER_VISIBLE FS_FL_USER_VISIBLE /* User visible flags */
@@ -503,10 +504,12 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff

#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT2_FEATURE_INCOMPAT_WHITEOUT| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -573,6 +576,7 @@ enum {
EXT2_FT_FIFO = 5,
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
+ EXT2_FT_WHT = 8,
EXT2_FT_MAX
};

--
1.6.3.3

2010-04-15 23:05:54

by Valerie Aurora

[permalink] [raw]

Subject: [PATCH 05/35] whiteout: Add vfs_whiteout() and whiteout inode operation

From: Jan Blunck <[email protected]>

Whiteout a given directory entry. File systems that support whiteouts
must implement the new ->whiteout() directory inode operation.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/vfs.txt | 10 +++-
fs/dcache.c | 4 +-
fs/namei.c | 133 +++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 6 ++
include/linux/fs.h | 2 +
5 files changed, 153 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3de2f32..8846b4f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -308,7 +308,7 @@ struct inode_operations
-----------------------

This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. As of kernel 2.6.33, the following members are defined:

struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
@@ -319,6 +319,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -382,6 +383,13 @@ otherwise noted.
will probably need to call d_instantiate() just as you would
in the create() method

+ whiteout: called by the rmdir(2) and unlink(2) system calls on a
+ layered file system. Only required if you want to support
+ whiteouts. The first dentry passed in is that for the old
+ dentry if it exists, and a negative dentry otherwise. The
+ second is the dentry for the whiteout itself. This method
+ must unlink() or rmdir() the original entry if it exists.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.

diff --git a/fs/dcache.c b/fs/dcache.c
index f1358e5..265015d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -992,8 +992,10 @@ EXPORT_SYMBOL(d_alloc_name);
/* the caller must hold dcache_lock */
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
- if (inode)
+ if (inode) {
+ dentry->d_flags &= ~DCACHE_WHITEOUT;
list_add(&dentry->d_alias, &inode->i_dentry);
+ }
dentry->d_inode = inode;
fsnotify_d_instantiate(dentry, inode);
}
diff --git a/fs/namei.c b/fs/namei.c
index 8e4c75f..010927b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2165,6 +2165,139 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
}

/*
+ * Checks on the victim for whiteout. We must both be able to delete
+ * the victim directory entry (if it exists) and create a new
+ * directory entry, so this function is a combination of the checks
+ * from may_create() and may_delete().
+ */
+static inline int may_whiteout(struct inode *dir, struct dentry *victim,
+ int isdir)
+{
+ int err;
+
+ /*
+ * From may_create(). We don't have to do this check for a
+ * simple delete because the directory must exist if we are
+ * trying to delete something from it. For a whiteout, the
+ * dir may be empty and thus potentially unlinked by this point.
+ */
+ if (IS_DEADDIR(dir))
+ return -ENOENT;
+ err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+ if (err)
+ return err;
+
+ /* From may_delete(). */
+ if (IS_APPEND(dir))
+ return -EPERM;
+ if (!victim->d_inode)
+ return 0;
+ if (check_sticky(dir, victim->d_inode) ||
+ IS_APPEND(victim->d_inode) ||
+ IS_IMMUTABLE(victim->d_inode))
+ return -EPERM;
+ if (isdir) {
+ if (!S_ISDIR(victim->d_inode->i_mode))
+ return -ENOTDIR;
+ if (IS_ROOT(victim))
+ return -EBUSY;
+ } else if (S_ISDIR(victim->d_inode->i_mode))
+ return -EISDIR;
+ if (victim->d_flags & DCACHE_NFSFS_RENAMED)
+ return -EBUSY;
+ return 0;
+}
+
+/**
+ * vfs_whiteout: create a whiteout for the given directory entry
+ * @dir: parent inode
+ * @dentry: directory entry to whiteout
+ *
+ * Create a whiteout for the given directory entry. A whiteout
+ * prevents lookup from dropping down to a lower layer of a union
+ * mounted file system.
+ *
+ * There are two important cases: (a) The directory entry to be
+ * whited-out may already exist, in which case it must first be
+ * deleted before we create the whiteout, and (b) no such directory
+ * entry exists and we only have to create the whiteout itself.
+ *
+ * The caller must pass in a dentry for the directory entry to be
+ * whited-out - a positive one if it exists, and a negative if not.
+ * When this function returns, the caller should dput() the old, now
+ * defunct dentry it passed in. The dentry for the whiteout itself is
+ * created inside this function.
+ */
+static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
+{
+ int err;
+ struct inode *old_inode = old_dentry->d_inode;
+ struct dentry *parent, *whiteout;
+
+ err = may_whiteout(dir, old_dentry, isdir);
+ if (err)
+ return err;
+
+ BUG_ON(old_dentry->d_parent->d_inode != dir);
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ /*
+ * If the old dentry is positive, then we have to delete this
+ * entry before we create the whiteout. The file system
+ * ->whiteout() op does the actual delete, but we do all the
+ * VFS-level checks and changes here.
+ */
+ if (old_inode) {
+ mutex_lock(&old_inode->i_mutex);
+ if (isdir)
+ dentry_unhash(old_dentry);
+ if (d_mountpoint(old_dentry))
+ err = -EBUSY;
+ else {
+ if (isdir)
+ err = security_inode_rmdir(dir, old_dentry);
+ else
+ err = security_inode_unlink(dir, old_dentry);
+ }
+ }
+
+ parent = dget_parent(old_dentry);
+ whiteout = d_alloc_name(parent, old_dentry->d_name.name);
+
+ if (!err)
+ err = dir->i_op->whiteout(dir, old_dentry, whiteout);
+
+ if (old_inode) {
+ mutex_unlock(&old_inode->i_mutex);
+ if (!err) {
+ fsnotify_link_count(old_inode);
+ d_delete(old_dentry);
+ }
+ if (isdir)
+ dput(old_dentry);
+ }
+
+ dput(whiteout);
+ dput(parent);
+ return err;
+}
+
+int path_whiteout(struct path *dir_path, struct dentry *dentry, int isdir)
+{
+ int error = mnt_want_write(dir_path->mnt);
+
+ if (!error) {
+ error = vfs_whiteout(dir_path->dentry->d_inode, dentry, isdir);
+ mnt_drop_write(dir_path->mnt);
+ }
+
+ return error;
+}
+EXPORT_SYMBOL(path_whiteout);
+
+/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 30b93b2..7648b49 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -183,6 +183,7 @@ d_iput: no no no yes
#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched by inotify */

#define DCACHE_COOKIE 0x0040 /* For use by dcookie subsystem */
+#define DCACHE_WHITEOUT 0x0080 /* This negative dentry is a whiteout */

#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */

@@ -358,6 +359,11 @@ static inline int d_unlinked(struct dentry *dentry)
return d_unhashed(dentry) && !IS_ROOT(dentry);
}

+static inline int d_is_whiteout(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_WHITEOUT);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 234ebc2..21102f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -209,6 +209,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_WHITEOUT (1<<25) /* FS supports whiteout filetype */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1527,6 +1528,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2010-04-16 15:59:56

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 04/35] whiteout/NFSD: Don't return information about whiteouts to userspace

Seems OK. (Though is there any way we could avoid having to add the
check to every filldir callback? Isn't the default going to be
disinterest in whiteouts? How are we avoiding all the same checks in
the case of lookup?)

--b.

On Thu, Apr 15, 2010 at 04:04:11PM -0700, Valerie Aurora wrote:
> From: Jan Blunck <[email protected]>
>
> Userspace isn't ready for handling another file type, so silently drop
> whiteout directory entries before they leave the kernel.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: [email protected]
> Cc: "J. Bruce Fields" <[email protected]>
> Cc: Neil Brown <[email protected]>
> ---
> fs/compat.c | 9 +++++++++
> fs/nfsd/nfs3xdr.c | 5 +++++
> fs/nfsd/nfs4xdr.c | 5 +++++
> fs/nfsd/nfsxdr.c | 4 ++++
> fs/readdir.c | 9 +++++++++
> 5 files changed, 32 insertions(+), 0 deletions(-)
>
> diff --git a/fs/compat.c b/fs/compat.c
> index 00d90c2..624e1a5 100644
> --- a/fs/compat.c
> +++ b/fs/compat.c
> @@ -838,6 +838,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
> struct compat_old_linux_dirent __user *dirent;
> compat_ulong_t d_ino;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> if (buf->result)
> return -EINVAL;
> d_ino = ino;
> @@ -909,6 +912,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
> compat_ulong_t d_ino;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> @@ -998,6 +1004,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
> int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
> u64 off;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 2a533a0..9b96f5a 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
> int elen; /* estimated entry length in words */
> int num_entry_words = 0; /* actual number of words */
>
> + if (d_type == DT_WHT) {
> + cd->common.err = nfs_ok;
> + return 0;
> + }
> +
> if (cd->offset) {
> u64 offset64 = offset;
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 78c7e24..8839ba8 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2268,6 +2268,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
> return 0;
> }
>
> + if (d_type == DT_WHT) {
> + cd->common.err = nfs_ok;
> + return 0;
> + }
> +
> if (cd->offset)
> xdr_encode_hyper(cd->offset, (u64) offset);
>
> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> index 4ce005d..0e57d4b 100644
> --- a/fs/nfsd/nfsxdr.c
> +++ b/fs/nfsd/nfsxdr.c
> @@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
> namlen, name, offset, ino);
> */
>
> + if (d_type == DT_WHT) {
> + cd->common.err = nfs_ok;
> + return 0;
> + }
> if (offset > ~((u32) 0)) {
> cd->common.err = nfserr_fbig;
> return -EINVAL;
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 7723401..3a48491 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> struct old_linux_dirent __user * dirent;
> unsigned long d_ino;
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> if (buf->result)
> return -EINVAL;
> d_ino = ino;
> @@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> unsigned long d_ino;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> @@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
> struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
>
> + if (d_type == DT_WHT)
> + return 0;
> +
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> --
> 1.6.3.3
>

2010-04-19 12:37:44

[permalink] [raw]

Subject: Re: [PATCH 04/35] whiteout/NFSD: Don't return information about whiteouts to userspace

On Fri, Apr 16, J. Bruce Fields wrote:

> Seems OK. (Though is there any way we could avoid having to add the
> check to every filldir callback? Isn't the default going to be
> disinterest in whiteouts? How are we avoiding all the same checks in
> the case of lookup?)
>

Bruce,

the alternative would be to include the check in the fs readdir()
implementation, and therefore prevent the call of the filler. I think this
patch would be even bigger.

Jan

> --b.
>
> On Thu, Apr 15, 2010 at 04:04:11PM -0700, Valerie Aurora wrote:
> > From: Jan Blunck <[email protected]>
> >
> > Userspace isn't ready for handling another file type, so silently drop
> > whiteout directory entries before they leave the kernel.
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Signed-off-by: Valerie Aurora <[email protected]>
> > Cc: [email protected]
> > Cc: "J. Bruce Fields" <[email protected]>
> > Cc: Neil Brown <[email protected]>
> > ---
> > fs/compat.c | 9 +++++++++
> > fs/nfsd/nfs3xdr.c | 5 +++++
> > fs/nfsd/nfs4xdr.c | 5 +++++
> > fs/nfsd/nfsxdr.c | 4 ++++
> > fs/readdir.c | 9 +++++++++
> > 5 files changed, 32 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/compat.c b/fs/compat.c
> > index 00d90c2..624e1a5 100644
> > --- a/fs/compat.c
> > +++ b/fs/compat.c
> > @@ -838,6 +838,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
> > struct compat_old_linux_dirent __user *dirent;
> > compat_ulong_t d_ino;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > if (buf->result)
> > return -EINVAL;
> > d_ino = ino;
> > @@ -909,6 +912,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
> > compat_ulong_t d_ino;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > @@ -998,6 +1004,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
> > int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
> > u64 off;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 2a533a0..9b96f5a 100644
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
> > int elen; /* estimated entry length in words */
> > int num_entry_words = 0; /* actual number of words */
> >
> > + if (d_type == DT_WHT) {
> > + cd->common.err = nfs_ok;
> > + return 0;
> > + }
> > +
> > if (cd->offset) {
> > u64 offset64 = offset;
> >
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 78c7e24..8839ba8 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -2268,6 +2268,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
> > return 0;
> > }
> >
> > + if (d_type == DT_WHT) {
> > + cd->common.err = nfs_ok;
> > + return 0;
> > + }
> > +
> > if (cd->offset)
> > xdr_encode_hyper(cd->offset, (u64) offset);
> >
> > diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> > index 4ce005d..0e57d4b 100644
> > --- a/fs/nfsd/nfsxdr.c
> > +++ b/fs/nfsd/nfsxdr.c
> > @@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
> > namlen, name, offset, ino);
> > */
> >
> > + if (d_type == DT_WHT) {
> > + cd->common.err = nfs_ok;
> > + return 0;
> > + }
> > if (offset > ~((u32) 0)) {
> > cd->common.err = nfserr_fbig;
> > return -EINVAL;
> > diff --git a/fs/readdir.c b/fs/readdir.c
> > index 7723401..3a48491 100644
> > --- a/fs/readdir.c
> > +++ b/fs/readdir.c
> > @@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> > struct old_linux_dirent __user * dirent;
> > unsigned long d_ino;
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > if (buf->result)
> > return -EINVAL;
> > d_ino = ino;
> > @@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> > unsigned long d_ino;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > @@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
> > struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
> >
> > + if (d_type == DT_WHT)
> > + return 0;
> > +
> > buf->error = -EINVAL; /* only used if we fail.. */
> > if (reclen > buf->count)
> > return -EINVAL;
> > --
> > 1.6.3.3
> >

2010-04-19 12:40:29

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Thu, Apr 15, Valerie Aurora wrote:

> Add support for fallthru directory entries to ext2.
>
> XXX - Makes up inode number for fallthru entry
> XXX - Might be better implemented as special symlinks

Better not. David Woodhouse actually convinced me of moving away from the
special symlink approach. The whiteouts have been implemented as special
symlinks before.

What makes you think that it would be beneficial to do so?

Thanks,
Jan

> Cc: Theodore Tso <[email protected]>
> Cc: [email protected]
> Signed-off-by: Valerie Aurora <[email protected]>
> Signed-off-by: Jan Blunck <[email protected]>
> ---
> fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext2/ext2.h | 1 +
> fs/ext2/namei.c | 22 +++++++++++
> include/linux/ext2_fs.h | 1 +
> 4 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index 030bd46..f3b4aff 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
> {
> if (len != de->name_len)
> return 0;
> - if (!de->inode && (de->file_type != EXT2_FT_WHT))
> + if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
> + (de->file_type != EXT2_FT_FALLTHRU)))
> return 0;
> return !memcmp(name, de->name, len);
> }
> @@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
> [EXT2_FT_SOCK] = DT_SOCK,
> [EXT2_FT_SYMLINK] = DT_LNK,
> [EXT2_FT_WHT] = DT_WHT,
> + [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
> };
>
> #define S_SHIFT 12
> @@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
> ext2_put_page(page);
> return 0;
> }
> + } else if (de->file_type == EXT2_FT_FALLTHRU) {
> + int over;
> + unsigned char d_type = DT_UNKNOWN;
> +
> + offset = (char *)de - kaddr;
> + /* XXX We don't know the inode number
> + * of the directory entry in the
> + * underlying file system. Should
> + * look it up, either on fallthru
> + * creation at first readdir or now at
> + * filldir time. */
> + over = filldir(dirent, de->name, de->name_len,
> + (n<<PAGE_CACHE_SHIFT) | offset,
> + 123 /* Made up ino */, d_type);
> + if (over) {
> + ext2_put_page(page);
> + return 0;
> + }
> }
> filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
> }
> @@ -463,6 +483,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
> spin_lock(&dentry->d_lock);
> dentry->d_flags |= DCACHE_WHITEOUT;
> spin_unlock(&dentry->d_lock);
> + } else if(!res && de->file_type == EXT2_FT_FALLTHRU) {
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> }
> ext2_put_page(page);
> }
> @@ -532,6 +556,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
> de->name_len = 0;
> de->rec_len = ext2_rec_len_to_disk(chunk_size);
> de->inode = 0;
> + de->file_type = 0;
> goto got_it;
> }
> if (de->rec_len == 0) {
> @@ -545,6 +570,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
> name_len = EXT2_DIR_REC_LEN(de->name_len);
> rec_len = ext2_rec_len_from_disk(de->rec_len);
> if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
> + (de->file_type != EXT2_FT_FALLTHRU) &&
> (rec_len >= reclen))
> goto got_it;
> if (rec_len >= name_len + reclen)
> @@ -587,7 +613,8 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
>
> err = -EEXIST;
> if (ext2_match (namelen, name, de)) {
> - if (de->file_type == EXT2_FT_WHT)
> + if ((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU))
> goto got_it;
> goto out_unlock;
> }
> @@ -602,7 +629,8 @@ got_it:
> &page, NULL);
> if (err)
> goto out_unlock;
> - if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
> + if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) &&
> !ext2_match (namelen, name, de))) {
> ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> @@ -627,6 +655,60 @@ out_unlock:
> }
>
> /*
> + * Create a fallthru entry.
> + */
> +int ext2_fallthru_entry (struct inode *dir, struct dentry *dentry)
> +{
> + const char *name = dentry->d_name.name;
> + int namelen = dentry->d_name.len;
> + unsigned short rec_len, name_len;
> + ext2_dirent * de;
> + struct page *page;
> + loff_t pos;
> + int err;
> +
> + de = ext2_append_entry(dentry, &page);
> + if (IS_ERR(de))
> + return PTR_ERR(de);
> +
> + err = -EEXIST;
> + if (ext2_match (namelen, name, de))
> + goto out_unlock;
> +
> + name_len = EXT2_DIR_REC_LEN(de->name_len);
> + rec_len = ext2_rec_len_from_disk(de->rec_len);
> +
> + pos = page_offset(page) +
> + (char*)de - (char*)page_address(page);
> + err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
> + &page, NULL);
> + if (err)
> + goto out_unlock;
> + if (de->inode || (de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) {
> + ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> + de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> + de->rec_len = ext2_rec_len_to_disk(name_len);
> + de = de1;
> + }
> + de->name_len = namelen;
> + memcpy(de->name, name, namelen);
> + de->inode = 0;
> + de->file_type = EXT2_FT_FALLTHRU;
> + err = ext2_commit_chunk(page, pos, rec_len);
> + dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
> + EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
> + mark_inode_dirty(dir);
> + /* OFFSET_CACHE */
> +out_put:
> + ext2_put_page(page);
> + return err;
> +out_unlock:
> + unlock_page(page);
> + goto out_put;
> +}
> +
> +/*
> * ext2_delete_entry deletes a directory entry by merging it with the
> * previous entry. Page is up-to-date. Releases the page.
> */
> @@ -711,7 +793,9 @@ int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
> */
> if (ext2_match (namelen, name, de))
> de->inode = 0;
> - if (de->inode || (de->file_type == EXT2_FT_WHT)) {
> + if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
> + (de->file_type == EXT2_FT_FALLTHRU)) &&
> + !ext2_match (namelen, name, de))) {
> ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
> de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
> de->rec_len = ext2_rec_len_to_disk(name_len);
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 44d190c..2fa32b3 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -108,6 +108,7 @@ extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *,
> extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
> extern int ext2_whiteout_entry (struct inode *, struct dentry *,
> struct ext2_dir_entry_2 *, struct page *);
> +extern int ext2_fallthru_entry (struct inode *, struct dentry *);
> extern int ext2_empty_dir (struct inode *);
> extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
> extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 12195a5..f28154c 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -349,6 +349,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
> goto out;
>
> spin_lock(&new_dentry->d_lock);
> + new_dentry->d_flags &= ~DCACHE_FALLTHRU;
> new_dentry->d_flags |= DCACHE_WHITEOUT;
> spin_unlock(&new_dentry->d_lock);
> d_add(new_dentry, NULL);
> @@ -367,6 +368,26 @@ out:
> return err;
> }
>
> +/*
> + * Create a fallthru entry.
> + */
> +static int ext2_fallthru (struct inode *dir, struct dentry *dentry)
> +{
> + int err;
> +
> + dquot_initialize(dir);
> +
> + err = ext2_fallthru_entry(dir, dentry);
> + if (err)
> + return err;
> +
> + d_instantiate(dentry, NULL);
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags |= DCACHE_FALLTHRU;
> + spin_unlock(&dentry->d_lock);
> + return 0;
> +}
> +
> static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
> struct inode * new_dir, struct dentry * new_dentry )
> {
> @@ -470,6 +491,7 @@ const struct inode_operations ext2_dir_inode_operations = {
> .rmdir = ext2_rmdir,
> .mknod = ext2_mknod,
> .whiteout = ext2_whiteout,
> + .fallthru = ext2_fallthru,
> .rename = ext2_rename,
> #ifdef CONFIG_EXT2_FS_XATTR
> .setxattr = generic_setxattr,
> diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
> index 20468bd..cb3d400 100644
> --- a/include/linux/ext2_fs.h
> +++ b/include/linux/ext2_fs.h
> @@ -577,6 +577,7 @@ enum {
> EXT2_FT_SOCK = 6,
> EXT2_FT_SYMLINK = 7,
> EXT2_FT_WHT = 8,
> + EXT2_FT_FALLTHRU = 9,
> EXT2_FT_MAX
> };
>
> --
> 1.6.3.3
>
Regards,
Jan

--
Jan Blunck <[email protected]>

2010-04-19 13:02:58

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> On Thu, Apr 15, Valerie Aurora wrote:
>
> > Add support for fallthru directory entries to ext2.
> >
> > XXX - Makes up inode number for fallthru entry
> > XXX - Might be better implemented as special symlinks
>
> Better not. David Woodhouse actually convinced me of moving away from the
> special symlink approach. The whiteouts have been implemented as special
> symlinks before.

I certainly asked whether you really need a real 'struct inode' for
whiteouts, and suggested that they should be represented _purely_ as a
dentry with type DT_WHT.

I don't much like the manifestation of that in this patch though,
especially with the made-up inode number. (ISTR I had other
jffs2-specific objections too, which I'll dig out and forward).

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2010-04-19 13:03:35

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH 11/35] whiteout: jffs2 whiteout support

On Thu, 2010-04-15 at 16:04 -0700, Valerie Aurora wrote:
> From: Felix Fietkau <[email protected]>
>
> Add support for whiteout dentries to jffs2.

This doesn't seem to have incorporated my feedback from the attached...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

Attachments:

(No filename) (5.67 kB)
Attached message - Re: [PATCH 16/41] whiteout: jffs2 whiteout support

2010-04-19 13:23:49

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Mon, Apr 19, David Woodhouse wrote:

> On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> > On Thu, Apr 15, Valerie Aurora wrote:
> >
> > > Add support for fallthru directory entries to ext2.
> > >
> > > XXX - Makes up inode number for fallthru entry
> > > XXX - Might be better implemented as special symlinks
> >
> > Better not. David Woodhouse actually convinced me of moving away from the
> > special symlink approach. The whiteouts have been implemented as special
> > symlinks before.
>
> I certainly asked whether you really need a real 'struct inode' for
> whiteouts, and suggested that they should be represented _purely_ as a
> dentry with type DT_WHT.
>
> I don't much like the manifestation of that in this patch though,
> especially with the made-up inode number. (ISTR I had other
> jffs2-specific objections too, which I'll dig out and forward).

Yes, this patches still have issues that Val and me are aware off. I can't
remember anything jffs2-specific though.

We return that inode number because we don't want to lookup the name on the
other filesystem during readdir. Therefore returning DT_UNKNOWN to let the
userspace decide if it needs to stat the file was the easiest workaround. I
know that POSIX requires d_ino and d_name but on the other hand it does not
require anything more on how long d_ino is valid.

If somebody has an idea how to make this cleaner please speak up.

Regards,
Jan

--
Jan Blunck <[email protected]>

2010-04-19 13:30:43

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Jan Blunck wrote:
> On Mon, Apr 19, David Woodhouse wrote:
>
> > On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> > > On Thu, Apr 15, Valerie Aurora wrote:
> > >
> > > > Add support for fallthru directory entries to ext2.
> > > >
> > > > XXX - Makes up inode number for fallthru entry
> > > > XXX - Might be better implemented as special symlinks
> > >
> > > Better not. David Woodhouse actually convinced me of moving away from the
> > > special symlink approach. The whiteouts have been implemented as special
> > > symlinks before.
> >
> > I certainly asked whether you really need a real 'struct inode' for
> > whiteouts, and suggested that they should be represented _purely_ as a
> > dentry with type DT_WHT.
> >
> > I don't much like the manifestation of that in this patch though,
> > especially with the made-up inode number. (ISTR I had other
> > jffs2-specific objections too, which I'll dig out and forward).
>
> Yes, this patches still have issues that Val and me are aware off. I can't
> remember anything jffs2-specific though.
>
> We return that inode number because we don't want to lookup the name on the
> other filesystem during readdir. Therefore returning DT_UNKNOWN to let the
> userspace decide if it needs to stat the file was the easiest workaround. I
> know that POSIX requires d_ino and d_name but on the other hand it does not
> require anything more on how long d_ino is valid.

Although the lifetime of d_ino might very, I know some programs (not
public) that will break if they see a d_ino which is wrongly matching
the st_ino of another file somewhere on the same st_dev. They will
assume the name is a hard link to the other file, without calling
stat(), which I think is a reasonable assumption and a useful optimisation.

So the made-up d_ino should at least be careful to not match an inode
number of another file which has a stable st_ino.

Why not zero for d_ino?

-- Jamie

2010-04-19 13:54:56

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 04/35] whiteout/NFSD: Don't return information about whiteouts to userspace

On Mon, Apr 19, 2010 at 02:37:41PM +0200, Jan Blunck wrote:
> On Fri, Apr 16, J. Bruce Fields wrote:
>
> > Seems OK. (Though is there any way we could avoid having to add the
> > check to every filldir callback? Isn't the default going to be
> > disinterest in whiteouts? How are we avoiding all the same checks in
> > the case of lookup?)
> >
>
> Bruce,
>
> the alternative would be to include the check in the fs readdir()
> implementation, and therefore prevent the call of the filler. I think this
> patch would be even bigger.

OK, makes sense.

--b.

>
> Jan
>
> > --b.
> >
> > On Thu, Apr 15, 2010 at 04:04:11PM -0700, Valerie Aurora wrote:
> > > From: Jan Blunck <[email protected]>
> > >
> > > Userspace isn't ready for handling another file type, so silently drop
> > > whiteout directory entries before they leave the kernel.
> > >
> > > Signed-off-by: Jan Blunck <[email protected]>
> > > Signed-off-by: David Woodhouse <[email protected]>
> > > Signed-off-by: Valerie Aurora <[email protected]>
> > > Cc: [email protected]
> > > Cc: "J. Bruce Fields" <[email protected]>
> > > Cc: Neil Brown <[email protected]>
> > > ---
> > > fs/compat.c | 9 +++++++++
> > > fs/nfsd/nfs3xdr.c | 5 +++++
> > > fs/nfsd/nfs4xdr.c | 5 +++++
> > > fs/nfsd/nfsxdr.c | 4 ++++
> > > fs/readdir.c | 9 +++++++++
> > > 5 files changed, 32 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/fs/compat.c b/fs/compat.c
> > > index 00d90c2..624e1a5 100644
> > > --- a/fs/compat.c
> > > +++ b/fs/compat.c
> > > @@ -838,6 +838,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
> > > struct compat_old_linux_dirent __user *dirent;
> > > compat_ulong_t d_ino;
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > if (buf->result)
> > > return -EINVAL;
> > > d_ino = ino;
> > > @@ -909,6 +912,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
> > > compat_ulong_t d_ino;
> > > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > buf->error = -EINVAL; /* only used if we fail.. */
> > > if (reclen > buf->count)
> > > return -EINVAL;
> > > @@ -998,6 +1004,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
> > > int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
> > > u64 off;
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > buf->error = -EINVAL; /* only used if we fail.. */
> > > if (reclen > buf->count)
> > > return -EINVAL;
> > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > > index 2a533a0..9b96f5a 100644
> > > --- a/fs/nfsd/nfs3xdr.c
> > > +++ b/fs/nfsd/nfs3xdr.c
> > > @@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
> > > int elen; /* estimated entry length in words */
> > > int num_entry_words = 0; /* actual number of words */
> > >
> > > + if (d_type == DT_WHT) {
> > > + cd->common.err = nfs_ok;
> > > + return 0;
> > > + }
> > > +
> > > if (cd->offset) {
> > > u64 offset64 = offset;
> > >
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 78c7e24..8839ba8 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -2268,6 +2268,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
> > > return 0;
> > > }
> > >
> > > + if (d_type == DT_WHT) {
> > > + cd->common.err = nfs_ok;
> > > + return 0;
> > > + }
> > > +
> > > if (cd->offset)
> > > xdr_encode_hyper(cd->offset, (u64) offset);
> > >
> > > diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> > > index 4ce005d..0e57d4b 100644
> > > --- a/fs/nfsd/nfsxdr.c
> > > +++ b/fs/nfsd/nfsxdr.c
> > > @@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
> > > namlen, name, offset, ino);
> > > */
> > >
> > > + if (d_type == DT_WHT) {
> > > + cd->common.err = nfs_ok;
> > > + return 0;
> > > + }
> > > if (offset > ~((u32) 0)) {
> > > cd->common.err = nfserr_fbig;
> > > return -EINVAL;
> > > diff --git a/fs/readdir.c b/fs/readdir.c
> > > index 7723401..3a48491 100644
> > > --- a/fs/readdir.c
> > > +++ b/fs/readdir.c
> > > @@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> > > struct old_linux_dirent __user * dirent;
> > > unsigned long d_ino;
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > if (buf->result)
> > > return -EINVAL;
> > > d_ino = ino;
> > > @@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> > > unsigned long d_ino;
> > > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > buf->error = -EINVAL; /* only used if we fail.. */
> > > if (reclen > buf->count)
> > > return -EINVAL;
> > > @@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
> > > struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> > > int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
> > >
> > > + if (d_type == DT_WHT)
> > > + return 0;
> > > +
> > > buf->error = -EINVAL; /* only used if we fail.. */
> > > if (reclen > buf->count)
> > > return -EINVAL;
> > > --
> > > 1.6.3.3
> > >

2010-04-19 14:12:51

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Mon, Apr 19, Jamie Lokier wrote:

> Jan Blunck wrote:
> > On Mon, Apr 19, David Woodhouse wrote:
> >
> > > On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> > > > On Thu, Apr 15, Valerie Aurora wrote:
> > > >
> > > > > Add support for fallthru directory entries to ext2.
> > > > >
> > > > > XXX - Makes up inode number for fallthru entry
> > > > > XXX - Might be better implemented as special symlinks
> > > >
> > > > Better not. David Woodhouse actually convinced me of moving away from the
> > > > special symlink approach. The whiteouts have been implemented as special
> > > > symlinks before.
> > >
> > > I certainly asked whether you really need a real 'struct inode' for
> > > whiteouts, and suggested that they should be represented _purely_ as a
> > > dentry with type DT_WHT.
> > >
> > > I don't much like the manifestation of that in this patch though,
> > > especially with the made-up inode number. (ISTR I had other
> > > jffs2-specific objections too, which I'll dig out and forward).
> >
> > Yes, this patches still have issues that Val and me are aware off. I can't
> > remember anything jffs2-specific though.
> >
> > We return that inode number because we don't want to lookup the name on the
> > other filesystem during readdir. Therefore returning DT_UNKNOWN to let the
> > userspace decide if it needs to stat the file was the easiest workaround. I
> > know that POSIX requires d_ino and d_name but on the other hand it does not
> > require anything more on how long d_ino is valid.
>
> Although the lifetime of d_ino might very, I know some programs (not
> public) that will break if they see a d_ino which is wrongly matching
> the st_ino of another file somewhere on the same st_dev. They will
> assume the name is a hard link to the other file, without calling
> stat(), which I think is a reasonable assumption and a useful optimisation.
>
> So the made-up d_ino should at least be careful to not match an inode
> number of another file which has a stable st_ino.
>
> Why not zero for d_ino?
>

Hmm, why not. Or even the ino of the directory we are reading from ...

Regards,
Jan

--
Jan Blunck <[email protected]>

2010-04-19 14:23:35

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Mon, Apr 19, 2010 at 04:12:48PM +0200, Jan Blunck wrote:
> On Mon, Apr 19, Jamie Lokier wrote:
>
> > Jan Blunck wrote:
> > > On Mon, Apr 19, David Woodhouse wrote:
> > >
> > > > On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> > > > > On Thu, Apr 15, Valerie Aurora wrote:
> > > > >
> > > > > > Add support for fallthru directory entries to ext2.
> > > > > >
> > > > > > XXX - Makes up inode number for fallthru entry
> > > > > > XXX - Might be better implemented as special symlinks
> > > > >
> > > > > Better not. David Woodhouse actually convinced me of moving away from the
> > > > > special symlink approach. The whiteouts have been implemented as special
> > > > > symlinks before.
> > > >
> > > > I certainly asked whether you really need a real 'struct inode' for
> > > > whiteouts, and suggested that they should be represented _purely_ as a
> > > > dentry with type DT_WHT.
> > > >
> > > > I don't much like the manifestation of that in this patch though,
> > > > especially with the made-up inode number. (ISTR I had other
> > > > jffs2-specific objections too, which I'll dig out and forward).
> > >
> > > Yes, this patches still have issues that Val and me are aware off. I can't
> > > remember anything jffs2-specific though.
> > >
> > > We return that inode number because we don't want to lookup the name on the
> > > other filesystem during readdir. Therefore returning DT_UNKNOWN to let the
> > > userspace decide if it needs to stat the file was the easiest workaround. I
> > > know that POSIX requires d_ino and d_name but on the other hand it does not
> > > require anything more on how long d_ino is valid.
> >
> > Although the lifetime of d_ino might very, I know some programs (not
> > public) that will break if they see a d_ino which is wrongly matching
> > the st_ino of another file somewhere on the same st_dev. They will
> > assume the name is a hard link to the other file, without calling
> > stat(), which I think is a reasonable assumption and a useful optimisation.
> >
> > So the made-up d_ino should at least be careful to not match an inode
> > number of another file which has a stable st_ino.
> >
> > Why not zero for d_ino?
> >
>
> Hmm, why not. Or even the ino of the directory we are reading from ...

I don't recall there being any technical reason not to look up the
real inode number. I just wrote it that we because I was lazy. So I
like returning the directory's d_ino better than a single magic
number, but I'd at least like to try returning the real inode number
too.

-VAL

2010-04-19 14:26:55

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 11/35] whiteout: jffs2 whiteout support

On Mon, Apr 19, 2010 at 02:03:27PM +0100, David Woodhouse wrote:
> On Thu, 2010-04-15 at 16:04 -0700, Valerie Aurora wrote:
> > From: Felix Fietkau <[email protected]>
> >
> > Add support for whiteout dentries to jffs2.
>
> This doesn't seem to have incorporated my feedback from the attached...

Hm, I'm not sure whether I lost the patch in a rebase or didn't have
time to test it or what. I was hoping someone who actually knows
JFFS2 like Felix or you would get to it first - in general, I'd like
the underlying file system maintainers to implement whiteouts and
fallthrus since they know them best. Felix, if you implemented it and
I lost the patch, my apologies to you.

Thanks David,

-VAL

> --
> David Woodhouse Open Source Technology Centre
> [email protected] Intel Corporation

Content-Description: Attached message - Re: [PATCH 16/41] whiteout: jffs2 whiteout support
> From: David Woodhouse <[email protected]>
> To: Valerie Aurora <[email protected]>
> Cc: Jan Blunck <[email protected]>, Alexander Viro <[email protected]>, Christoph Hellwig <[email protected]>, Andy Whitcroft <[email protected]>, Scott James Remnant <[email protected]>, Sandu Popa Marius <[email protected]>, Jan Rekorajski <[email protected]>, "J. R. Okajima" <[email protected]>, Arnd Bergmann <[email protected]>, Vladimir Dronnikov <[email protected]>, Felix Fietkau <[email protected]>, [email protected], [email protected], [email protected]
> Date: Thu, 22 Oct 2009 07:50:58 +0900
> Subject: Re: [PATCH 16/41] whiteout: jffs2 whiteout support
>
> On Wed, 2009-10-21 at 12:19 -0700, Valerie Aurora wrote:
> > From: Felix Fietkau <[email protected]>
> >
> > Add support for whiteout dentries to jffs2.
>
> As discussed, there are a few places where JFFS2 will assume that a
> dirent with fd->ino == 0 is a deletion dirent -- a kind of whiteout of
> its own, used internally because it's a log-structured file system and
> it needs to mark previously existing dirents as having been unlinked.
>
> You're breaking that assumption. So, for example, your whiteouts are
> going to get lost when the eraseblock containing them is garbage
> collected -- because they'll be treated like deletion dirents, which
> only need to remain on the medium for as long as the _real_ dirents
> which they exist to kill.
>
> This completely untested patch addresses some of it.
>
> The other thing to verify is the three places in dir.c which check
> whether whiteout/rmdir/rename should return -ENOTEMPTY. Those all do so
> by checking whether the directory in question has any dirents with
> fd->ino != 0 -- i.e. does it contain any _real_ dirents, or only the
> deletion markers for dead stuff.
>
> So that will now be _allowing_ you to remove a directory which contains
> whiteouts, since you haven't changed the test. Is that intentional? It
> seems sane at first glance.
>
> diff --git a/fs/jffs2/build.c b/fs/jffs2/build.c
> index c5e1450..4dc883f 100644
> --- a/fs/jffs2/build.c
> +++ b/fs/jffs2/build.c
> @@ -217,8 +217,9 @@ static void jffs2_build_remove_unlinked_inode(struct jffs2_sb_info *c,
> ic->scan_dents = fd->next;
>
> if (!fd->ino) {
> - /* It's a deletion dirent. Ignore it */
> - dbg_fsbuild("child \"%s\" is a deletion dirent, skipping...\n", fd->name);
> + dbg_fsbuild("child \"%s\" is a %s, skipping...\n",
> + fd->name,
> + (fd->type == DT_WHT)?"whiteout":"deletion dirent");
> jffs2_free_full_dirent(fd);
> continue;
> }
> diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
> index 090c556..7f5afbb 100644
> --- a/fs/jffs2/gc.c
> +++ b/fs/jffs2/gc.c
> @@ -516,7 +516,7 @@ static int jffs2_garbage_collect_live(struct jffs2_sb_info *c, struct jffs2_era
> break;
> }
>
> - if (fd && fd->ino) {
> + if (fd && (fd->ino || fd->type == DT_WHT)) {
> ret = jffs2_garbage_collect_dirent(c, jeb, f, fd);
> } else if (fd) {
> ret = jffs2_garbage_collect_deletion_dirent(c, jeb, f, fd);
> @@ -895,7 +895,7 @@ static int jffs2_garbage_collect_deletion_dirent(struct jffs2_sb_info *c, struct
> continue;
>
> /* If the name length doesn't match, or it's another deletion dirent, skip */
> - if (rd->nsize != name_len || !je32_to_cpu(rd->ino))
> + if (rd->nsize != name_len || (!je32_to_cpu(rd->ino) && rd->type != DT_WHT))
> continue;
>
> /* OK, check the actual name now */
> diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c
> index ca29440..bcd4b86 100644
> --- a/fs/jffs2/write.c
> +++ b/fs/jffs2/write.c
> @@ -629,8 +629,9 @@ int jffs2_do_unlink(struct jffs2_sb_info *c, struct jffs2_inode_info *dir_f,
> printk(KERN_WARNING "Deleting inode #%u with active dentry \"%s\"->ino #%u\n",
> dead_f->inocache->ino, fd->name, fd->ino);
> } else {
> - D1(printk(KERN_DEBUG "Removing deletion dirent for \"%s\" from dir ino #%u\n",
> - fd->name, dead_f->inocache->ino));
> + D1(printk(KERN_DEBUG "Removing %s for \"%s\" from dir ino #%u\n",
> + (fd->type == DT_WHT)?"whiteout":"deletion dirent",
> + fd->name, dead_f->inocache->ino));
> }
> if (fd->raw)
> jffs2_mark_node_obsolete(c, fd->raw);
>
>
> --
> David Woodhouse Open Source Technology Centre
> [email protected] Intel Corporation

2010-04-19 15:09:46

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Mon, 19 Apr 2010, Valerie Aurora wrote:
> I don't recall there being any technical reason not to look up the
> real inode number. I just wrote it that we because I was lazy. So I
> like returning the directory's d_ino better than a single magic
> number, but I'd at least like to try returning the real inode number
> too.

Note, "struct dirent" doesn't have d_dev, so you really can't return
the "real" inode number, that's on a different filesystem and just a
random number in the context of the the readdir in question.

Thanks,
Miklos

2010-04-20 16:30:41

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

On Thu, 15 Apr 2010, Valerie Aurora wrote:
> +VFS implementation
> +==================
> +
> +Writable overlays are implemented as an integral part of the VFS,
> +rather than as a VFS client file system (i.e., a stacked file system
> +like unionfs or ecryptfs). Implementing writable overlays inside the
> +VFS eliminates the need for duplicate copies of VFS data structures,
> +unnecessary indirection, and code duplication, but requires very
> +maintainable, low-to-zero overhead code. Writable overlays require no
> +change to file systems serving as the read-only layer, and requires
> +some minor support from file systems serving as the read-write layer.
> +File systems that want to be the writable layer must implement the new
> +->whiteout() and ->fallthru() inode operations, which create special
> +dummy directory entries.

Maybe this should have been discussed earlier, but looking at all the
places where copyup and whiteout logic needs to be added (and the
current code is still unfinished, as you state) makes me wonder, does
all that really belong in the VFS?

What exactly are the areas where a VFS implementation eliminates
duplication and unnecessary indirection? Well, it turns out that in
the current implementation there's only one place, and that's
non-directory nodes.

Which begs the question: why do all the other things (union lookup,
directory merging and copyup, file copyup) need to be in the VFS?
Especially since I can imagine other union implementations wanting to
do these differently (e.g. not copying up directories in readdir).

What really needs to be in the VFS is the ability to:

- allow a filesystem to "redirect" a lookup to a different fs,

- if the operation happens to modify the file, then *not* redirect the
lookup

And there is already one example for the above: LAST_BIND lookups in
/proc. So basically it's mostly there and just needs to be
implemented in a filesystem.

Have I missed something fundamental? Are there other reasons why a
filesystem based implementation would be inferior?

Thanks,
Miklos

2010-04-20 21:35:12

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Miklos Szeredi wrote:
> On Mon, 19 Apr 2010, Valerie Aurora wrote:
> > I don't recall there being any technical reason not to look up the
> > real inode number. I just wrote it that we because I was lazy. So I
> > like returning the directory's d_ino better than a single magic
> > number, but I'd at least like to try returning the real inode number
> > too.
>
> Note, "struct dirent" doesn't have d_dev, so you really can't return
> the "real" inode number, that's on a different filesystem and just a
> random number in the context of the the readdir in question.

Agree. Does this inappropriate inode number for the union mount's
st_dev happen with stat() on the actual files too? That could be bad.

-- Jamie

2010-04-20 21:40:52

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Valerie Aurora wrote:
> On Mon, Apr 19, 2010 at 04:12:48PM +0200, Jan Blunck wrote:
> > On Mon, Apr 19, Jamie Lokier wrote:
> >
> > > Jan Blunck wrote:
> > > > On Mon, Apr 19, David Woodhouse wrote:
> > > >
> > > > > On Mon, 2010-04-19 at 14:40 +0200, Jan Blunck wrote:
> > > > > > On Thu, Apr 15, Valerie Aurora wrote:
> > > > > >
> > > > > > > Add support for fallthru directory entries to ext2.
> > > > > > >
> > > > > > > XXX - Makes up inode number for fallthru entry
> > > > > > > XXX - Might be better implemented as special symlinks
> > > > > >
> > > > > > Better not. David Woodhouse actually convinced me of moving away from the
> > > > > > special symlink approach. The whiteouts have been implemented as special
> > > > > > symlinks before.
> > > > >
> > > > > I certainly asked whether you really need a real 'struct inode' for
> > > > > whiteouts, and suggested that they should be represented _purely_ as a
> > > > > dentry with type DT_WHT.
> > > > >
> > > > > I don't much like the manifestation of that in this patch though,
> > > > > especially with the made-up inode number. (ISTR I had other
> > > > > jffs2-specific objections too, which I'll dig out and forward).
> > > >
> > > > Yes, this patches still have issues that Val and me are aware off. I can't
> > > > remember anything jffs2-specific though.
> > > >
> > > > We return that inode number because we don't want to lookup the name on the
> > > > other filesystem during readdir. Therefore returning DT_UNKNOWN to let the
> > > > userspace decide if it needs to stat the file was the easiest workaround. I
> > > > know that POSIX requires d_ino and d_name but on the other hand it does not
> > > > require anything more on how long d_ino is valid.
> > >
> > > Although the lifetime of d_ino might very, I know some programs (not
> > > public) that will break if they see a d_ino which is wrongly matching
> > > the st_ino of another file somewhere on the same st_dev. They will
> > > assume the name is a hard link to the other file, without calling
> > > stat(), which I think is a reasonable assumption and a useful optimisation.
> > >
> > > So the made-up d_ino should at least be careful to not match an inode
> > > number of another file which has a stable st_ino.
> > >
> > > Why not zero for d_ino?
> > >
> >
> > Hmm, why not. Or even the ino of the directory we are reading from ...
>
> I don't recall there being any technical reason not to look up the
> real inode number. I just wrote it that we because I was lazy. So I
> like returning the directory's d_ino better than a single magic
> number, but I'd at least like to try returning the real inode number
> too.

I thought of zero because Bash and GNU Readline both check d_ino != 0
to decide if an entry is valid.

On reflection, that is why zero must _not_ be used :-)

-- Jamie

2010-04-21 08:42:16

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Tue, Apr 20, Jamie Lokier wrote:

> Miklos Szeredi wrote:
> > On Mon, 19 Apr 2010, Valerie Aurora wrote:
> > > I don't recall there being any technical reason not to look up the
> > > real inode number. I just wrote it that we because I was lazy. So I
> > > like returning the directory's d_ino better than a single magic
> > > number, but I'd at least like to try returning the real inode number
> > > too.
> >
> > Note, "struct dirent" doesn't have d_dev, so you really can't return
> > the "real" inode number, that's on a different filesystem and just a
> > random number in the context of the the readdir in question.
>
> Agree. Does this inappropriate inode number for the union mount's
> st_dev happen with stat() on the actual files too? That could be bad.

No, for stat() you do a lookup and that is returning the correct dentry/inode
for the filesystem the name is on.

We just return the the fallthru directory entries to give userspace an offset
that they can seekdir() to.

Regards,
Jan

--
Jan Blunck <[email protected]>

2010-04-21 09:22:54

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Jan Blunck wrote:
> On Tue, Apr 20, Jamie Lokier wrote:
>
> > Miklos Szeredi wrote:
> > > On Mon, 19 Apr 2010, Valerie Aurora wrote:
> > > > I don't recall there being any technical reason not to look up the
> > > > real inode number. I just wrote it that we because I was lazy. So I
> > > > like returning the directory's d_ino better than a single magic
> > > > number, but I'd at least like to try returning the real inode number
> > > > too.
> > >
> > > Note, "struct dirent" doesn't have d_dev, so you really can't return
> > > the "real" inode number, that's on a different filesystem and just a
> > > random number in the context of the the readdir in question.
> >
> > Agree. Does this inappropriate inode number for the union mount's
> > st_dev happen with stat() on the actual files too? That could be bad.
>
> No, for stat() you do a lookup and that is returning the correct
> dentry/inode for the filesystem the name is on.

Hmm. I smell potential confusion for some otherwise POSIX-friendly
userspaces.

When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
keep the file open, then later do a readdir which includes foo
(dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
or unlink happened, close the file, abort streaming from it, refresh
the GUI windows, refresh application caches for that name entry, etc.

Because in the POSIX world I think open files have stable inode
numbers (as long as they are open), and I don't think that an open
file can have it's name's d_ino not match the inode number unless it's
a mount point, which my program would know about.

This plays into inotify, where you have to know if you are monitoring
every directory that contains a link to a file, to know if you need to
monitor the file itself directly instead.

Now I think it's fair enough that a union mount doesn't play all the
traditional rules :-) C'est la vie.

This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
file-bind-mount. Like bind mounts, it's quite annoying for programs
that like to assume they've seen all of a file's links when they've
seen i_nlink of them.

Bind mounts can be detected by looking in /proc/mounts. st_dev
changing doesn't work because it can be a binding of the same
filesystem.

How would I go about detecting when a union mount's directory entry
has similar behaviour, without calling stat() on each entry? Is it
just a matter of recognising a particular filesystem name in
/proc/mounts, or something more?

Thanks,
-- Jamie

2010-04-21 09:35:22

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Wed, 21 Apr 2010, Jamie Lokier wrote:
> Hmm. I smell potential confusion for some otherwise POSIX-friendly
> userspaces.
>
> When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
> keep the file open, then later do a readdir which includes foo
> (dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
> or unlink happened, close the file, abort streaming from it, refresh
> the GUI windows, refresh application caches for that name entry, etc.
>
> Because in the POSIX world I think open files have stable inode
> numbers (as long as they are open), and I don't think that an open
> file can have it's name's d_ino not match the inode number unless it's
> a mount point, which my program would know about.
>
> This plays into inotify, where you have to know if you are monitoring
> every directory that contains a link to a file, to know if you need to
> monitor the file itself directly instead.
>
> Now I think it's fair enough that a union mount doesn't play all the
> traditional rules :-) C'est la vie.
>
> This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
> file-bind-mount. Like bind mounts, it's quite annoying for programs
> that like to assume they've seen all of a file's links when they've
> seen i_nlink of them.
>
> Bind mounts can be detected by looking in /proc/mounts. st_dev
> changing doesn't work because it can be a binding of the same
> filesystem.
>
> How would I go about detecting when a union mount's directory entry
> has similar behaviour, without calling stat() on each entry? Is it
> just a matter of recognising a particular filesystem name in
> /proc/mounts, or something more?

Detecting mount points is best done by comparing st_dev for the parent
directory with st_dev of the child. This is much simpler than parsing
/proc/mounts and will work for bind mounts as well as union mounts.

I think there's no question that union mounts might break apps (POSIX
or not). But I think there's hope that they are few and can easily be
fixed.

Thanks,
Miklos

2010-04-21 09:52:33

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Miklos Szeredi wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> > Hmm. I smell potential confusion for some otherwise POSIX-friendly
> > userspaces.
> >
> > When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
> > keep the file open, then later do a readdir which includes foo
> > (dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
> > or unlink happened, close the file, abort streaming from it, refresh
> > the GUI windows, refresh application caches for that name entry, etc.
> >
> > Because in the POSIX world I think open files have stable inode
> > numbers (as long as they are open), and I don't think that an open
> > file can have it's name's d_ino not match the inode number unless it's
> > a mount point, which my program would know about.
> >
> > This plays into inotify, where you have to know if you are monitoring
> > every directory that contains a link to a file, to know if you need to
> > monitor the file itself directly instead.
> >
> > Now I think it's fair enough that a union mount doesn't play all the
> > traditional rules :-) C'est la vie.
> >
> > This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
> > file-bind-mount. Like bind mounts, it's quite annoying for programs
> > that like to assume they've seen all of a file's links when they've
> > seen i_nlink of them.
> >
> > Bind mounts can be detected by looking in /proc/mounts. st_dev
> > changing doesn't work because it can be a binding of the same
> > filesystem.
> >
> > How would I go about detecting when a union mount's directory entry
> > has similar behaviour, without calling stat() on each entry? Is it
> > just a matter of recognising a particular filesystem name in
> > /proc/mounts, or something more?
>
> Detecting mount points is best done by comparing st_dev for the parent
> directory with st_dev of the child. This is much simpler than parsing
> /proc/mounts and will work for bind mounts as well as union mounts.

Sorry, no: That does not work for bind mounts. Both layers can have
the same st_dev. Nor does O_NOFOLLOW stop traversal in the middle of
a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
d_type to say it's a bind mount. Bind mounts are really a big pain
for i_nlink+inotify name counting.

Besides, calling stat() on every entry in a large directory to check
st_ino can be orders of magnitude slower than readdir() on a large
directory - especially with a cold cache. It is quicker, but much
more complicated, to parse /proc/mounts and apply arcane rules to find
the exceptions.

Can a union mount overlap two parts of the same filesystem?

> I think there's no question that union mounts might break apps (POSIX
> or not). But I think there's hope that they are few and can easily be
> fixed.

I agree, and union moint is a very useful feature that's worth
breaking a few apps for :-)

I'm curious if there's a clear way to go about it in this case, or
if it'll involve a certain amount of pattern recognition in /proc/mounts.

Basically I'm wondering if it's been thought about already.

-- Jamie

2010-04-21 10:18:09

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Wed, 21 Apr 2010, Jamie Lokier wrote:
> Sorry, no: That does not work for bind mounts. Both layers can have
> the same st_dev.

Okay.

> Nor does O_NOFOLLOW stop traversal in the middle of
> a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> d_type to say it's a bind mount. Bind mounts are really a big pain
> for i_nlink+inotify name counting.

I'm confused. You are monitoring a specific file and would like to
know if something is happening to any of it's links, right?

Why do you need to know about bind mounts for that?

Count the number of times you encounter that d_ino and if that matches
i_nlink then every directory is monitored. Simple as that, no?

Thanks,
Miklos

2010-04-21 17:36:25

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Miklos Szeredi wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> > Sorry, no: That does not work for bind mounts. Both layers can have
> > the same st_dev.
>
> Okay.
>
> > Nor does O_NOFOLLOW stop traversal in the middle of
> > a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> > d_type to say it's a bind mount. Bind mounts are really a big pain
> > for i_nlink+inotify name counting.
>
> I'm confused. You are monitoring a specific file and would like to
> know if something is happening to any of it's links, right?

Not quite. I'm monitoring a million files (say), so I must use
directory watches for most of them. I need directory watches anyway,
when the semantic is "calling open on /path/to/file and reading would
return the same data", because renames and unlinks are also a way to
invalidate monitored file contents.

At a high level, what we're talking about is the ability to cache and
verify the the validity information derived from reading files in the
filesystem, in a manner which efficiently triggers invalidation only
on changes. Being able to answer, as quickly as possible, "if I read
this, that and other, will I get the same results as the last time I
did those operations, without having to actually do them to check".
There are many applications, provided the method is reliable.

> Why do you need to know about bind mounts for that?
>
> Count the number of times you encounter that d_ino and if that matches
> i_nlink then every directory is monitored. Simple as that, no?

When I see a file has i_nlink > 1, I must watch the file directly
using a file-watch (with inotify; polling with stat() with dnotify),
_unless_ I have seen all the links to that file.

When I've seen all the links to a file, I know that my directory
watches on the directories containing those links are sufficient to
detect changes to the file contents. That's because every
file change will get notified on at least one of those paths.

I learn that I've seen all the links by seeing d_ino during readdir as
you suggested, or by st_ino in the cases where I've not had reason to
readdir and I have needed to open the file or call stat.

Let's look at some bind mounts. One where st_ino doesn't work:

/dirA/file1 [hard link to inode 100, i_nlink = 2]
/dirA/bound [bind mount, has /dirA/file1 mounted on it]
/dirB/file2 [hard link to inode 100, i_nlink = 2]

If the program is asked to open /dirA/file1 and /dirA/bound at various
times, and never asked to readdir /dirA, it will have used fstat not
readdir, seen the same (st_dev,st_ino,i_nlink = 2), and _wrongly_
concluded that it is monitoring all directories containing paths to
the file.

To avoid that problem, it parses /proc/mounts and detects that
/dirA/bound does not contributed to the link count. This is faster
than calling readdir in all possible places that it can happen.

Another one, where readdir + d_ino doesn't work anyway:

/dirA/file1 [hard link to inode 100, i_nlink = 2]
/dirB/dirX [bind mount, has /dirA mounted on it]
/dirC/file2 [hard link to inode 100, i_nlink = 2]

This time the program is asked to open /dirA/file1 and
/dirB/dirX/file1 at various times. Suppose it aggressively calls
readdir on all of the places it goes near, and uses d_ino comparisons.

Bear in mind it can't hunt for /dirC because there may be millions of
directories; this is just an example.

Then it will see the same d_ino for /dirA/file1 and /dirB/dirX/file1,
and wrongly conclude that it is monitoring all directories containing
paths to the file.

So again, it must parse /proc/mounts to detect that everything found
under /dirB/dirX mirrors /dirA.

This is a bit more complicated by the fact that inotify/dnotify send
events to the watching dentry parent of the link used to access a
file, not necessarily the parent in the mounted path space.

Although this doesn't make the bind mount problem go away, this is
where union mounts complicate the picture more:

Ideally, the program may assume that d_ino and st_ino match as long as
the file is open (on any filesystem), or that the filesystem type is
in a whitelist of ones with stable inode numbers (most local
filesystems), and it's not a mountpoint. So when it's asked to open
at one path, and something else asks it to readdir at another path, it
could combine the information to learn when it's found all entries,
without having to use redundant readdirs and stats.

I'm thinking that I might have to detect union mounts specially in
/proc/mounts, now that they are a VFS feature, and disable a bunch of
assumptions about d_ino when seeing them. Hopefully it is possible to
unambiguously check for union mount points in /proc/mounts?

d_ino == directory's st_ino sounds neat. Maybe that will be enough,
as a special magical Linux rule. When reading a directory, it's cheap
to get the directory's st_ino with fstat(). It's possible to bind
mount a directory on it's _own_ child, so that st_ino == directory's
st_ino, but d_ino isn't affected so maybe that's the trick to use.

-- Jamie

2010-04-21 21:34:59

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Wed, Apr 21, 2010 at 10:52:21AM +0100, Jamie Lokier wrote:
> Miklos Szeredi wrote:
> > Detecting mount points is best done by comparing st_dev for the parent
> > directory with st_dev of the child. This is much simpler than parsing
> > /proc/mounts and will work for bind mounts as well as union mounts.
>
> Sorry, no: That does not work for bind mounts. Both layers can have
> the same st_dev. Nor does O_NOFOLLOW stop traversal in the middle of
> a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> d_type to say it's a bind mount. Bind mounts are really a big pain
> for i_nlink+inotify name counting.
>
> Besides, calling stat() on every entry in a large directory to check
> st_ino can be orders of magnitude slower than readdir() on a large
> directory - especially with a cold cache. It is quicker, but much
> more complicated, to parse /proc/mounts and apply arcane rules to find
> the exceptions.
>
> Can a union mount overlap two parts of the same filesystem?

No. Each layer must be a separate file system, the bottom must be
read-only, the top must be writable, and they must be unioned at their
mount points.

> > I think there's no question that union mounts might break apps (POSIX
> > or not). But I think there's hope that they are few and can easily be
> > fixed.
>
> I agree, and union moint is a very useful feature that's worth
> breaking a few apps for :-)
>
> I'm curious if there's a clear way to go about it in this case, or
> if it'll involve a certain amount of pattern recognition in /proc/mounts.

All it takes is looking for the "union" string in the mount options.

> Basically I'm wondering if it's been thought about already.

Not as much as it deserves. :) Do you have any thoughts about better
solutions?

Something to keep in mind is that most of the app issues are already
present with bind mounts. In many cases, if an app doesn't work with
union mounts, it's also not going to work with bind mounts. I think
you have a good point that we could use a more straightforward way to
say, "Hey, you can't use the normal st_dev/st_ino rules right now..."

-VAL

2010-04-21 21:39:08

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

On Wed, Apr 21, 2010 at 11:34:52AM +0200, Miklos Szeredi wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> > Hmm. I smell potential confusion for some otherwise POSIX-friendly
> > userspaces.
> >
> > When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
> > keep the file open, then later do a readdir which includes foo
> > (dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
> > or unlink happened, close the file, abort streaming from it, refresh
> > the GUI windows, refresh application caches for that name entry, etc.
> >
> > Because in the POSIX world I think open files have stable inode
> > numbers (as long as they are open), and I don't think that an open
> > file can have it's name's d_ino not match the inode number unless it's
> > a mount point, which my program would know about.
> >
> > This plays into inotify, where you have to know if you are monitoring
> > every directory that contains a link to a file, to know if you need to
> > monitor the file itself directly instead.
> >
> > Now I think it's fair enough that a union mount doesn't play all the
> > traditional rules :-) C'est la vie.
> >
> > This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
> > file-bind-mount. Like bind mounts, it's quite annoying for programs
> > that like to assume they've seen all of a file's links when they've
> > seen i_nlink of them.
> >
> > Bind mounts can be detected by looking in /proc/mounts. st_dev
> > changing doesn't work because it can be a binding of the same
> > filesystem.
> >
> > How would I go about detecting when a union mount's directory entry
> > has similar behaviour, without calling stat() on each entry? Is it
> > just a matter of recognising a particular filesystem name in
> > /proc/mounts, or something more?
>
> Detecting mount points is best done by comparing st_dev for the parent
> directory with st_dev of the child. This is much simpler than parsing
> /proc/mounts and will work for bind mounts as well as union mounts.
>
> I think there's no question that union mounts might break apps (POSIX
> or not). But I think there's hope that they are few and can easily be
> fixed.

I couldn't have put it better myself.

To expand slightly, if the broken apps are not few and easily fixed,
then we'll go back and make the kernel more complicated. I'd like to
try the simplest version we think will work, first.

Thanks!

-VAL

2010-04-21 22:11:01

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Valerie Aurora wrote:
> > I think there's no question that union mounts might break apps (POSIX
> > or not). But I think there's hope that they are few and can easily be
> > fixed.
>
> I couldn't have put it better myself.
>
> To expand slightly, if the broken apps are not few and easily fixed,
> then we'll go back and make the kernel more complicated. I'd like to
> try the simplest version we think will work, first.

Don't worry, I'm not trying to deviate you from that good plan.

Just throwing questions out to find what's a good and simple answer to
these little open questions to minimise trouble.

-- Jamie

2010-04-22 10:37:57

by J. R. Okajima

[permalink] [raw]

Subject: Re: [PATCH 13/35] fallthru: ext2 fallthru support

Jamie Lokier:
> Hmm. I smell potential confusion for some otherwise POSIX-friendly
> userspaces.
:::
> This plays into inotify, where you have to know if you are monitoring
> every directory that contains a link to a file, to know if you need to
> monitor the file itself directly instead.

Addition to the inode number of fallthru/readdir, hardlink in union
mount may be a problem. If you open a hardlinked file for writing or
try chmod it, the internal copyup will happen and the hardlink will be
destroyed. For instance, when fileA and fileB are hardlinked on the
lower layer, and the contents of fileA is modifed (copyup happens). You
will not see the latest contents via fileB.
And the IN_CREATE event may be fired to the parent dir if you monitor
it, I am afraid.

(I have pointed out this issue before, but the posted document didn't
seem to contain about it)

J. R. Okajima

2010-04-28 20:19:27

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

On Tue, Apr 20, 2010 at 06:30:10PM +0200, Miklos Szeredi wrote:
> On Thu, 15 Apr 2010, Valerie Aurora wrote:
> > +VFS implementation
> > +==================
> > +
> > +Writable overlays are implemented as an integral part of the VFS,
> > +rather than as a VFS client file system (i.e., a stacked file system
> > +like unionfs or ecryptfs). Implementing writable overlays inside the
> > +VFS eliminates the need for duplicate copies of VFS data structures,
> > +unnecessary indirection, and code duplication, but requires very
> > +maintainable, low-to-zero overhead code. Writable overlays require no
> > +change to file systems serving as the read-only layer, and requires
> > +some minor support from file systems serving as the read-write layer.
> > +File systems that want to be the writable layer must implement the new
> > +->whiteout() and ->fallthru() inode operations, which create special
> > +dummy directory entries.
>
> Maybe this should have been discussed earlier, but looking at all the
> places where copyup and whiteout logic needs to be added (and the
> current code is still unfinished, as you state) makes me wonder, does
> all that really belong in the VFS?
>
> What exactly are the areas where a VFS implementation eliminates
> duplication and unnecessary indirection? Well, it turns out that in
> the current implementation there's only one place, and that's
> non-directory nodes.
>
> Which begs the question: why do all the other things (union lookup,
> directory merging and copyup, file copyup) need to be in the VFS?
> Especially since I can imagine other union implementations wanting to
> do these differently (e.g. not copying up directories in readdir).
>
> What really needs to be in the VFS is the ability to:
>
> - allow a filesystem to "redirect" a lookup to a different fs,
>
> - if the operation happens to modify the file, then *not* redirect the
> lookup
>
> And there is already one example for the above: LAST_BIND lookups in
> /proc. So basically it's mostly there and just needs to be
> implemented in a filesystem.
>
> Have I missed something fundamental? Are there other reasons why a
> filesystem based implementation would be inferior?

I'm sorry I have responded sooner, I've been trying to write a
detailed useful message and that turns out to be hard. I'll just
include a few of the highlights; mainly I want to say that I'd
rather do it the way you describe but when I tried it ended up even
uglier than the VFS implementation.

I went down this road initially (do most of the unioning in a file
system) and spent a couple of months on it. But I always ended up
having to do some level of copy-around and redirection similar to that
in unionfs.

One of the major difficulties that arises even when doing unioning at
the VFS level is keeping around the parent's path in order to do the
copyup later on. Take a look at the code pattern in the "union-mount:
Implement union-aware syscall()" series of patches. That's the
prettiest and most efficient version I could come up with, after two
other implementations, and it's in the VFS, at the vfs_foo_syscall()
level. I don't even know how I would start if I had to wait until the
file system op is called.

If you have some insights on how to do this, I'd love to hear them. I
don't enjoy writing VFS code for the fun of it. :)

Thanks,

-VAL

2010-04-30 16:50:32

by J. R. Okajima

[permalink] [raw]

Subject: Re: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

Valerie Aurora:
> One of the major difficulties that arises even when doing unioning at
> the VFS level is keeping around the parent's path in order to do the
> copyup later on. Take a look at the code pattern in the "union-mount:
> Implement union-aware syscall()" series of patches. That's the
> prettiest and most efficient version I could come up with, after two
> other implementations, and it's in the VFS, at the vfs_foo_syscall()
> level. I don't even know how I would start if I had to wait until the
> file system op is called.

I agree that is prettiest, and copup at open for write makes it easier.
But some applications issue mmap(MAP_PRIVATE) after open(O_RDWR), for
example modprobe(8). In this case, every kernel module will be copied-up
and it must be a waste of time and space. And I guess this is one reason
why other implementation took the approach of copyup at write.
At the same time, I guess this issue may be less important since other
parts are pretty enough.

J. R. Okajima

2010-04-30 17:22:40

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

On Thu, Apr 29, 2010 at 11:33:39AM +0200, Miklos Szeredi wrote:
> On Wed, 28 Apr 2010, Valerie Aurora wrote:
> > I'm sorry I have responded sooner, I've been trying to write a
> > detailed useful message and that turns out to be hard. I'll just
> > include a few of the highlights; mainly I want to say that I'd
> > rather do it the way you describe but when I tried it ended up even
> > uglier than the VFS implementation.
> >
> > I went down this road initially (do most of the unioning in a file
> > system) and spent a couple of months on it. But I always ended up
> > having to do some level of copy-around and redirection similar to that
> > in unionfs.
>
> I haven't looked at unionfs in a long time. Can you say something
> more specific about what these problems were?

Sure. The short version is that unionfs has to allocate another copy
of each file system structure - inode, etc. - and then keep an array
of the matching structures from each of the file system layers. Each
unionfs file system op copies data up and down between the unionfs
structures and the underlying structures, and then calls the lower
file system op as necessary. Often it has to duplicate code from the
VFS before calling the lower file system ops.

Where union mounts has the advantage is that we make zero copies of
file system data structures and therefore don't need copyup or
interposition on as many ops. But if you wait until the file system
op is called, you have to attach your union-related data to the
associated data structure, and the underlying file system is already
using the private data pointer. And you have to keep a copy of the
underlying file system ops. And each data structure can be part of
multiple unions. So you end up with an effective second copy of the
file system data structure and a mess of linked lists or pointers.

> > One of the major difficulties that arises even when doing unioning at
> > the VFS level is keeping around the parent's path in order to do the
> > copyup later on. Take a look at the code pattern in the "union-mount:
> > Implement union-aware syscall()" series of patches. That's the
> > prettiest and most efficient version I could come up with, after two
> > other implementations, and it's in the VFS, at the vfs_foo_syscall()
> > level. I don't even know how I would start if I had to wait until the
> > file system op is called.
>
> On a high level I don't see a problem, the parent of every dentry can
> be found through ->d_parent.

Unfortunately, dentries aren't unioned - paths (dentry/mnt pairs) are.
So you can get the parent dentry in the file system op, but the dentry
is potentially part of many different mounts. There's no mapping from
a lower-level read-only dentry to the covering read-write parent
dentry because the read-only dentry could potentially be mounted in 5
different places. Which union mount is this dentry part of? You have
to record the parent's path during lookup and carry it around until
you do the copyup - for every syscall that alters a file, not just
open() and write(), but chmod(), etc. So if you implement it in the
VFS, you don't have to carry that info across the file system op
boundary.

I think the chmod() case really shows the issues well. user_path_nd()
records the parent's path during lookup (in an inefficient, possibly
racy manner), then union_copyup() does the copy (too early, before a
lot of permission checks). The underlying file system doesn't get
involved until the ->setattr() call in notify_change(), and all that
gets is the dentry.

> One issue is having to duplicate some locking and other stuff around
> vfs_whatever() calls. But that could be fixed by exporting suitable
> helpers from the VFS.

That's somewhat of an issue right now. For union mounts to be most
efficient and wonderful, system calls should be separated into two
sequential parts called from the same context as the user_path()
lookup:

1) permission checks and all read-only checks that can fail.
[union copyup happens here]
2) the actual write or change to the file system

Otherwise we have to push the parent nameidata down through the stack
to where the actual change happens. So if want to avoid copying up
the file unless chmod() succeeds, in the current code structure I'd
have to add a nameidata and a mnt to notify_change()'s arguments. But
this is an optimization, not a correctness problem.

-VAL

2010-04-30 22:37:54

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH 16/35] union-mount: Writable overlays/union mounts documentation

On Wed, 28 Apr 2010, Valerie Aurora wrote:
> I'm sorry I have responded sooner, I've been trying to write a
> detailed useful message and that turns out to be hard. I'll just
> include a few of the highlights; mainly I want to say that I'd
> rather do it the way you describe but when I tried it ended up even
> uglier than the VFS implementation.
>
> I went down this road initially (do most of the unioning in a file
> system) and spent a couple of months on it. But I always ended up
> having to do some level of copy-around and redirection similar to that
> in unionfs.

I haven't looked at unionfs in a long time. Can you say something
more specific about what these problems were?

> One of the major difficulties that arises even when doing unioning at
> the VFS level is keeping around the parent's path in order to do the
> copyup later on. Take a look at the code pattern in the "union-mount:
> Implement union-aware syscall()" series of patches. That's the
> prettiest and most efficient version I could come up with, after two
> other implementations, and it's in the VFS, at the vfs_foo_syscall()
> level. I don't even know how I would start if I had to wait until the
> file system op is called.

On a high level I don't see a problem, the parent of every dentry can
be found through ->d_parent.

One issue is having to duplicate some locking and other stuff around
vfs_whatever() calls. But that could be fixed by exporting suitable
helpers from the VFS.

Other than that I don't see any fundamental issues with union
filesystems (except that they seem to grow too many features to be
maintainable).

Thanks,
Miklos