2007-12-05 14:38:18

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

Hi,

In Union Mount, the merged view of directories of the union is obtained
by enhancing readdir(2)/getdents(2) to read and merge the entries of
all the directories by eliminating the duplicates. While we have tried
a few approaches for this, none of them could perfectly solve all the problems.
One of the challenges has been to provide a correct directory seek support for
the union mounted directories. Sometime back when I posted an
RFC (http://lkml.org/lkml/2007/9/7/22) on this problem, one of the
suggestions I got was to have the dirents stored in a cache (which we
already do for duplicate elimination) and define the directory seek
behaviour on this cache constructed out of unioned directories.

I set out to try this and the result is this set of patches. I am myself
not impressed by the implementation complexity in these patches but posting
them here only to get further comments and suggestions. Moreover I haven't
debugged them thoroughly to uncover all the problems. While I don't expect
anybody try these patches, for the completeness sake I have to mention that
these apply on top of Jan Blunck's patches on 2.6.24-rc2-mm1
(ftp://ftp.suse.com/pub/people/jblunck/patches/).

In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed to
define a seek behaviour on the union mounted directory. This cache is stored
as part of the struct file of the topmost directory of the union and will
remain as long as the directory is kept open.

While this approach works, it has the following downsides:

- The cache can grow arbitrarily large in size for big directories thereby
consuming lots of memory. Pruning individual cache entries is out of question
as entire cache is needed for subsequent readdirs for duplicate elimination.

- When an exising union gets overlaid by a new directory, there is a
possibility of the cache getting duplicated for the new union, thereby wasting
more space.

- Whenever _any_ directory that is part of the union gets
modified (addition/deletion of entries), the dirent cache of all the unions
which this directory is part of, needs to be purged and rebuilt. This is
expensive not only due to re-reads of dirents but also because
readdir(2)/getdents(2) needs to be synchronized with other operations
like mkdir/mknod/link/unlink etc.

- Since lseek(2) of the unioned directory also works on the same dirent
cache, it too needs to be invalidated when the directory gets modified.

- Supporting SEEK_END of lseek(2) is expensive as it involves reading in
all the dirents to know the EOF of the directory file.

After all this, I am beginning to think if it would be better to delegate
this readdir and whiteout processing to userspace. Can this be better handled
by readdir(3) in glibc ? But even with this, I am not sure if correct seek
behaviour (from lseek(2) or seekdir(3)) can be achieved in an easy manner.
Any thoughts on this ?

Earlier Erez Zodak had suggested that things will become easier if readdir
state is stored as a disk file (http://lkml.org/lkml/2007/9/7/114). This
approach simplifies directory seek support in Unionfs. But I am not sure if
such an approach would go well with VFS based unification approach like
Union Mount.

Regards,
Bharata.


2007-12-05 14:39:00

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 1/5] Remove existing directory listing implementation

Remove the existing readdir implementation.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/readdir.c | 10 +
fs/union.c | 333 --------------------------------------------------
include/linux/union.h | 23 ---
3 files changed, 8 insertions(+), 358 deletions(-)

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/unistd.h>
-#include <linux/union.h>

#include <asm/uaccess.h>

int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
int res = -ENOTDIR;

if (!file->f_op || !file->f_op->readdir)
@@ -31,7 +31,13 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;

- res = do_readdir(file, buf, filler);
+ mutex_lock(&inode->i_mutex);
+ res = -ENOENT;
+ if (!IS_DEADDIR(inode)) {
+ res = file->f_op->readdir(file, buf, filler);
+ file_accessed(file);
+ }
+ mutex_unlock(&inode->i_mutex);
out:
return res;
}
--- a/fs/union.c
+++ b/fs/union.c
@@ -516,339 +516,6 @@ int last_union_is_root(struct path *path
}

/*
- * Union mounts support for readdir.
- */
-
-/* This is a copy from fs/readdir.c */
-struct getdents_callback {
- struct linux_dirent __user *current_dir;
- struct linux_dirent __user *previous;
- int count;
- int error;
-};
-
-/* The readdir union cache object */
-struct union_cache_entry {
- struct list_head list;
- struct qstr name;
-};
-
-static int union_cache_add_entry(struct list_head *list,
- const char *name, int namelen)
-{
- struct union_cache_entry *this;
- char *tmp_name;
-
- this = kmalloc(sizeof(*this), GFP_KERNEL);
- if (!this) {
- printk(KERN_CRIT
- "union_cache_add_entry(): out of kernel memory\n");
- return -ENOMEM;
- }
-
- tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
- if (!tmp_name) {
- printk(KERN_CRIT
- "union_cache_add_entry(): out of kernel memory\n");
- kfree(this);
- return -ENOMEM;
- }
-
- this->name.name = tmp_name;
- this->name.len = namelen;
- this->name.hash = 0;
- memcpy(tmp_name, name, namelen);
- tmp_name[namelen] = 0;
- INIT_LIST_HEAD(&this->list);
- list_add(&this->list, list);
- return 0;
-}
-
-static void union_cache_free(struct list_head *uc_list)
-{
- struct list_head *p;
- struct list_head *ptmp;
- int count = 0;
-
- list_for_each_safe(p, ptmp, uc_list) {
- struct union_cache_entry *this;
-
- this = list_entry(p, struct union_cache_entry, list);
- list_del_init(&this->list);
- kfree(this->name.name);
- kfree(this);
- count++;
- }
- return;
-}
-
-static int union_cache_find_entry(struct list_head *uc_list,
- const char *name, int namelen)
-{
- struct union_cache_entry *p;
- int ret = 0;
-
- list_for_each_entry(p, uc_list, list) {
- if (p->name.len != namelen)
- continue;
- if (strncmp(p->name.name, name, namelen) == 0) {
- ret = 1;
- break;
- }
- }
-
- return ret;
-}
-
-/*
- * There are four filldir() wrapper necessary for the union mount readdir
- * implementation:
- *
- * - filldir_topmost(): fills the union's readdir cache and the user space
- * buffer. This is only used for the topmost directory
- * in the union stack.
- * - filldir_topmost_cacheonly(): only fills the union's readdir cache.
- * This is only used for the topmost directory in the
- * union stack.
- * - filldir_overlaid(): fills the union's readdir cache and the user space
- * buffer. This is only used for directories on the
- * stack's lower layers.
- * - filldir_overlaid_cacheonly(): only fills the union's readdir cache.
- * This is only used for directories on the stack's
- * lower layers.
- */
-
-struct union_cache_callback {
- struct getdents_callback *buf; /* original getdents_callback */
- struct list_head list; /* list of union cache entries */
- filldir_t filler; /* the filldir() we should call */
- loff_t offset; /* base offset of our dirents */
- loff_t count; /* maximum number of bytes to "read" */
-};
-
-static int filldir_topmost(void *buf, const char *name, int namlen,
- loff_t offset, u64 ino, unsigned int d_type)
-{
- struct union_cache_callback *cb = buf;
-
- union_cache_add_entry(&cb->list, name, namlen);
- return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
- d_type);
-}
-
-static int filldir_topmost_cacheonly(void *buf, const char *name, int namlen,
- loff_t offset, u64 ino,
- unsigned int d_type)
-{
- struct union_cache_callback *cb = buf;
-
- if (offset > cb->count)
- return -EINVAL;
-
- union_cache_add_entry(&cb->list, name, namlen);
- return 0;
-}
-
-static int filldir_overlaid(void *buf, const char *name, int namlen,
- loff_t offset, u64 ino, unsigned int d_type)
-{
- struct union_cache_callback *cb = buf;
-
- switch (namlen) {
- case 2:
- if (name[1] != '.')
- break;
- case 1:
- if (name[0] != '.')
- break;
- return 0;
- }
-
- if (union_cache_find_entry(&cb->list, name, namlen))
- return 0;
-
- union_cache_add_entry(&cb->list, name, namlen);
- return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
- d_type);
-}
-
-static int filldir_overlaid_cacheonly(void *buf, const char *name, int namlen,
- loff_t offset, u64 ino,
- unsigned int d_type)
-{
- struct union_cache_callback *cb = buf;
-
- if (offset > cb->count)
- return -EINVAL;
-
- switch (namlen) {
- case 2:
- if (name[1] != '.')
- break;
- case 1:
- if (name[0] != '.')
- break;
- return 0;
- }
-
- if (union_cache_find_entry(&cb->list, name, namlen))
- return 0;
-
- union_cache_add_entry(&cb->list, name, namlen);
- return 0;
-}
-
-/*
- * readdir_union_cache - A helper to fill the readdir cache
- */
-static int readdir_union_cache(struct file *file, void *_buf, filldir_t filler)
-{
- struct union_cache_callback *cb = _buf;
- int old_count;
- loff_t old_pos;
- int res;
-
- old_count = cb->count;
- cb->count = ((file->f_pos > i_size_read(file->f_path.dentry->d_inode)) ?
- i_size_read(file->f_path.dentry->d_inode) :
- file->f_pos) & INT_MAX;
- old_pos = file->f_pos;
- file->f_pos = 0;
- res = file->f_op->readdir(file, _buf, filler);
- file->f_pos = old_pos;
- cb->count = old_count;
- return res;
-}
-
-/*
- * readdir_union - A wrapper around ->readdir()
- *
- * This is a wrapper around the filesystems readdir(), which is walking
- * the union stack and calls ->readdir() for every directory in the stack.
- * The directory entries are read into the union mounts readdir cache to
- * support whiteout's and duplicate removal.
- */
-int readdir_union(struct file *file, void *buf, filldir_t filler)
-{
- struct inode *inode = file->f_path.dentry->d_inode;
- struct union_cache_callback cb;
- struct path path;
- loff_t offset = 0;
- int res = 0;
-
- mutex_lock(&inode->i_mutex);
- if (IS_DEADDIR(inode)) {
- mutex_unlock(&inode->i_mutex);
- return -ENOENT;
- }
-
- INIT_LIST_HEAD(&cb.list);
- cb.buf = buf;
- cb.filler = filler;
- cb.offset = 0;
- offset = i_size_read(file->f_path.dentry->d_inode);
- cb.count = file->f_pos;
-
- if (file->f_pos > 0) {
- /*
- * We have already read from this dir, lets read that stuff to
- * our union-cache only
- */
- res = readdir_union_cache(file, &cb,
- filldir_topmost_cacheonly);
- if (res) {
- mutex_unlock(&inode->i_mutex);
- goto out;
- }
- }
-
- if (file->f_pos < offset) {
- res = file->f_op->readdir(file, &cb, filldir_topmost);
- file_accessed(file);
- if (res) {
- mutex_unlock(&inode->i_mutex);
- goto out;
- }
- /* We read until EOF of this directory */
- file->f_pos = offset;
- }
-
- mutex_unlock(&inode->i_mutex);
-
- path = file->f_path;
- path_get(&path);
- while (follow_union_down(&path.mnt, &path.dentry)) {
- struct file *ftmp;
-
- /* get path reference for filep */
- path_get(&path);
- ftmp = dentry_open(path.dentry, path.mnt,
- ((file->f_flags & ~(O_ACCMODE)) |
- O_RDONLY | O_DIRECTORY | O_NOATIME));
- if (IS_ERR(ftmp)) {
- res = PTR_ERR(ftmp);
- break;
- }
-
- inode = path.dentry->d_inode;
- mutex_lock(&inode->i_mutex);
-
- /* rearrange the file position */
- cb.offset += offset;
- offset = i_size_read(inode);
- ftmp->f_pos = file->f_pos - cb.offset;
- cb.count = ftmp->f_pos;
- if (ftmp->f_pos < 0) {
- mutex_unlock(&inode->i_mutex);
- fput(ftmp);
- break;
- }
-
- res = -ENOENT;
- if (IS_DEADDIR(inode))
- goto out_fput;
-
- if (ftmp->f_pos > 0) {
- /*
- * We have already read from this dir, lets read that
- * stuff to our union-cache only
- */
- res = readdir_union_cache(ftmp, &cb,
- filldir_overlaid_cacheonly);
- if (res)
- goto out_fput;
- }
-
- if (ftmp->f_pos < offset) {
- res = ftmp->f_op->readdir(ftmp, &cb, filldir_overlaid);
- file_accessed(ftmp);
- if (res)
- file->f_pos += ftmp->f_pos;
- else
- /*
- * We read until EOF of this directory, so lets
- * advance the f_pos by the maximum offset
- * (i_size) of this directory
- */
- file->f_pos += offset;
- }
-
- file_accessed(ftmp);
-
-out_fput:
- mutex_unlock(&inode->i_mutex);
- fput(ftmp);
-
- if (res)
- break;
- }
- path_put(&path);
-out:
- union_cache_free(&cb.list);
- return res;
-}
-
-/*
* Union mount copyup support
*/

--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -54,7 +54,6 @@ extern int attach_mnt_union(struct vfsmo
struct dentry *);
extern void detach_mnt_union(struct vfsmount *);
extern int last_union_is_root(struct path *);
-extern int readdir_union(struct file *, void *, filldir_t);
extern int is_dir_unioned(struct path *);
extern int union_relookup_topmost(struct nameidata *, int);
extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
@@ -83,27 +82,5 @@ extern int union_copyup(struct nameidata

#endif /* CONFIG_UNION_MOUNT */

-static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
-{
- int res = 0;
- struct inode *inode = file->f_path.dentry->d_inode;
-
-#ifdef CONFIG_UNION_MOUNT
- if (IS_MNT_UNION(file->f_path.mnt) && is_dir_unioned(&file->f_path))
- res = readdir_union(file, buf, filler);
- else
-#endif
- {
- mutex_lock(&inode->i_mutex);
- res = -ENOENT;
- if (!IS_DEADDIR(inode)) {
- res = file->f_op->readdir(file, buf, filler);
- file_accessed(file);
- }
- mutex_unlock(&inode->i_mutex);
- }
- return res;
-}
-
#endif /* __KERNEL__ */
#endif /* __LINUX_UNION_H */

2007-12-05 14:39:44

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 2/5] Add New directory listing approach

Another readdir implementation for union uounted directories.

Reads dirents from all layers of the union into a cache, eliminates duplicates,
before returning them into userspace. The cache is stored persistently as part
of struct file of the topmost directory. Instead of original directory offsets,
offsets are defined as linearly increasing indices on this cache and the same
is returned to userspace.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/file_table.c | 1
fs/readdir.c | 10 -
fs/union.c | 281 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 30 +++++
include/linux/union.h | 28 ++++
5 files changed, 342 insertions(+), 8 deletions(-)

--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -286,6 +286,7 @@ void fastcall __fput(struct file *file)
drop_file_write_access(file);

put_pid(file->f_owner.pid);
+ put_rdstate(file->f_rdstate);
file_kill(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/unistd.h>
+#include <linux/union.h>

#include <asm/uaccess.h>

int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
- struct inode *inode = file->f_path.dentry->d_inode;
int res = -ENOTDIR;

if (!file->f_op || !file->f_op->readdir)
@@ -31,13 +31,7 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;

- mutex_lock(&inode->i_mutex);
- res = -ENOENT;
- if (!IS_DEADDIR(inode)) {
- res = file->f_op->readdir(file, buf, filler);
- file_accessed(file);
- }
- mutex_unlock(&inode->i_mutex);
+ res = do_readdir(file, buf, filler);
out:
return res;
}
--- a/fs/union.c
+++ b/fs/union.c
@@ -46,8 +46,10 @@ static struct hlist_head *union_rhashtab
* - union_lock
*/
DEFINE_SPINLOCK(union_lock);
+DEFINE_MUTEX(union_rdmutex);

static struct kmem_cache *union_cache __read_mostly;
+static struct kmem_cache *readdir_cache;

static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
{
@@ -101,6 +103,9 @@ static int __init init_union(void)
for (loop = 0; loop < (1 << union_rhash_shift); loop++)
INIT_HLIST_HEAD(&union_rhashtable[loop]);

+ readdir_cache = kmem_cache_create("readdir-cache",
+ sizeof(struct rdcache_entry), 0,
+ SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
return 0;
}

@@ -516,6 +521,282 @@ int last_union_is_root(struct path *path
}

/*
+ * readdir support for Union mounts.
+ */
+
+struct rdcache_callback {
+ void *buf; /* original callback buffer */
+ filldir_t filldir; /* the filldir() we should call */
+ int error; /* stores filldir error */
+ struct rdstate *rdstate; /* readdir state */
+};
+
+/*
+ * This is called after every ->readdir() to persistently store the number of
+ * entries in a directory in the corresponding union_mount structure.
+ */
+static void update_um_dirents(struct rdstate *r)
+{
+ struct union_mount *um;
+
+ spin_lock(&union_lock);
+ um = union_lookup(r->cur_path.dentry, r->cur_path.mnt);
+ if (!um)
+ goto out;
+ um->nr_dirents = r->nr_dirents;
+out:
+ spin_unlock(&union_lock);
+}
+
+static void rdcache_free(struct list_head *list)
+{
+ struct list_head *p;
+ struct list_head *ptmp;
+ int count = 0;
+
+ list_for_each_safe(p, ptmp, list) {
+ struct rdcache_entry *this;
+
+ this = list_entry(p, struct rdcache_entry, list);
+ list_del_init(&this->list);
+ kfree(this->name.name);
+ kmem_cache_free(readdir_cache, this);
+ count++;
+ }
+ INIT_LIST_HEAD(list);
+ return;
+}
+
+static int rdcache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+ struct rdcache_entry *p;
+ int ret = 0;
+
+ list_for_each_entry(p, uc_list, list) {
+ if (p->name.len != namelen)
+ continue;
+ if (strncmp(p->name.name, name, namelen) == 0) {
+ ret = 1;
+ break;
+ }
+ }
+ return ret;
+}
+
+static int rdcache_add_entry(struct rdstate *r, struct list_head *list,
+ const char *name, int namelen, loff_t offset, u64 ino,
+ unsigned int d_type)
+{
+ struct rdcache_entry *this;
+ char *tmp_name;
+
+ this = kmem_cache_alloc(readdir_cache, GFP_KERNEL);
+ if (!this) {
+ printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
+ return -ENOMEM;
+ }
+
+ tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
+ if (!tmp_name) {
+ printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
+ kmem_cache_free(readdir_cache, this);
+ return -ENOMEM;
+ }
+
+ this->name.name = tmp_name;
+ this->name.len = namelen;
+ this->name.hash = 0;
+ memcpy(tmp_name, name, namelen);
+ tmp_name[namelen] = 0;
+ this->off = offset;
+ this->ino = ino;
+ this->dtype = d_type;
+ INIT_LIST_HEAD(&this->list);
+ list_add_tail(&this->list, list);
+ return 0;
+}
+
+/*
+ * filldir routine for union mounted directories.
+ * Handles duplicate elimination by building a readdir cache.
+ */
+static int filldir_union(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct rdcache_callback *cb = buf;
+ struct rdstate *r = cb->rdstate;
+ int err = 0;
+
+ /*
+ * When a dirent gets skipped like this, the offset of the
+ * next dirent from the previous dirent will also not point to the
+ * skipped dirent.
+ */
+ if (rdcache_find_entry(&r->dirent_cache, name, namlen))
+ return 0;
+
+ err = cb->filldir(cb->buf, name, namlen, r->cur_off,
+ ino, d_type);
+ if (err >= 0) {
+ rdcache_add_entry(r, &r->dirent_cache,
+ name, namlen, offset, ino, d_type);
+ r->cur_off = ++r->last_off;
+ r->nr_dirents++;
+ }
+ cb->error = err;
+ return err;
+}
+
+/* Called from last fput() */
+void put_rdstate(struct rdstate *rdstate)
+{
+ if (!rdstate)
+ return;
+
+ mutex_lock(&union_rdmutex);
+ path_put(&rdstate->cur_path);
+ rdcache_free(&rdstate->dirent_cache);
+ mutex_unlock(&union_rdmutex);
+ kfree(rdstate);
+}
+
+static struct rdstate *get_rdstate(struct file *file)
+{
+ struct rdstate *r = file->f_rdstate;
+
+ if (r)
+ return r;
+
+ /*
+ * We have read the dirents from this earlier but now don't have a
+ * corresponding rdstate. This shouldn't happen.
+ */
+ if (file->f_pos)
+ return ERR_PTR(-EINVAL);
+
+ r = kzalloc(sizeof(struct rdstate), GFP_KERNEL);
+ if (!r)
+ return ERR_PTR(-ENOMEM);
+
+ r->cur_path = file->f_path;
+ path_get(&r->cur_path);
+ INIT_LIST_HEAD(&r->dirent_cache);
+ file->f_rdstate = r;
+ return r;
+}
+
+int readdir_union(struct file *file, void *buf, filldir_t filler)
+{
+ struct dentry *topmost = file->f_path.dentry;
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct rdstate *rdstate;
+ struct path path;
+ loff_t offset = 0;
+ struct rdcache_callback cb;
+ int err = 0;
+
+ if (IS_DEADDIR(inode))
+ return -ENOENT;
+
+ rdstate = get_rdstate(file);
+ if (IS_ERR(rdstate)) {
+ err = PTR_ERR(rdstate);
+ return err;
+ }
+
+ cb.buf = buf;
+ cb.filldir = filler;
+ cb.rdstate = rdstate;
+ cb.error = 0;
+
+ offset = rdstate->file_off;
+
+ /* Read from the topmost directory */
+ if (rdstate->cur_path.dentry == topmost) {
+ file->f_pos = offset;
+ err = file->f_op->readdir(file, &cb, filldir_union);
+ rdstate->file_off = file->f_pos;
+ update_um_dirents(rdstate);
+ if (err >= 0)
+ err = cb.error;
+ if (err < 0)
+ goto out;
+
+ /*
+ * Reading from topmost dir complete, start reading the lower
+ * dir from the beginning.
+ */
+ offset = 0;
+ path = file->f_path;
+ path_get(&path);
+ if (!follow_union_down(&path.mnt, &path.dentry))
+ goto out_pathput;
+ rdstate->nr_dirents = 0;
+ } else {
+ path = rdstate->cur_path;
+ path_get(&path);
+ }
+
+ do {
+ struct file *ftmp;
+
+ /* Get a reference for ftmp */
+ path_get(&path);
+ ftmp = dentry_open(path.dentry, path.mnt,
+ ((file->f_flags & ~(O_ACCMODE)) |
+ O_RDONLY | O_DIRECTORY | O_NOATIME));
+ if (IS_ERR(ftmp)) {
+ err = PTR_ERR(ftmp);
+ goto out_pathput;
+ }
+
+ inode = path.dentry->d_inode;
+
+ mutex_lock(&inode->i_mutex); /* TODO: use _nested version */
+ if (IS_DEADDIR(inode)) {
+ mutex_unlock(&inode->i_mutex);
+ err = -ENOENT;
+ goto out_pathput;
+ }
+
+ ftmp->f_pos = offset;
+
+ err = ftmp->f_op->readdir(ftmp, &cb, filldir_union);
+ file_accessed(ftmp);
+ rdstate->file_off = ftmp->f_pos;
+ mutex_unlock(&inode->i_mutex);
+ /* TODO: Better to unconditionally put and get ? */
+ if (path.mnt != rdstate->cur_path.mnt) {
+ mntput(rdstate->cur_path.mnt);
+ rdstate->cur_path.mnt = mntget(path.mnt);
+ }
+ if (path.dentry != rdstate->cur_path.dentry) {
+ dput(rdstate->cur_path.dentry);
+ rdstate->cur_path.dentry = dget(path.dentry);
+ }
+ fput(ftmp);
+ update_um_dirents(rdstate);
+ if (err >= 0)
+ err = cb.error;
+ if (err < 0)
+ goto out_pathput;
+
+ /*
+ * Reading from a lower dir complete, start reading the
+ * next lower dir from the beginning.
+ */
+ offset = 0;
+ rdstate->nr_dirents = 0;
+ } while (follow_union_down(&path.mnt, &path.dentry));
+
+out_pathput:
+ path_put(&path);
+out:
+ return err;
+}
+
+/*
* Union mount copyup support
*/

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -785,6 +785,33 @@ static inline int ra_has_index(struct fi
index < ra->start + ra->size);
}

+#ifdef CONFIG_UNION_MOUNT
+/* The readdir cache object */
+struct rdcache_entry {
+ struct list_head list;
+ unsigned long ino;
+ unsigned long off;
+ struct qstr name;
+ unsigned int dtype;
+};
+
+struct rdstate {
+ struct path cur_path; /* Current directory on which readdir is
+ in progress */
+ loff_t file_off; /* File offset of underlying directory */
+ loff_t cur_off; /* Offset to current dirent in rdcache */
+ loff_t last_off; /* Offset to last dirent in rdcache */
+ loff_t nr_dirents; /* Number of entries from current underlying
+ directory in rdcache */
+ struct list_head dirent_cache; /* cache of directory entries */
+};
+
+extern void put_rdstate(struct rdstate *rdstate);
+
+#else
+#define put_rdstate(x) do { } while (0)
+#endif
+
#define FILE_MNT_WRITE_TAKEN 1
#define FILE_MNT_WRITE_RELEASED 2

@@ -823,6 +850,9 @@ struct file {
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
unsigned long f_mnt_write_state;
+#ifdef CONFIG_UNION_MOUNT
+ struct rdstate *f_rdstate;
+#endif
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -36,6 +36,7 @@ struct union_mount {

struct path u_this; /* this is me */
struct path u_next; /* this is what I overlay */
+ loff_t nr_dirents; /* nr dirents in this directory */
};

#define IS_UNION(dentry) (!list_empty(&(dentry)->d_unions) || \
@@ -53,6 +54,7 @@ extern void __shrink_d_unions(struct den
extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
struct dentry *);
extern void detach_mnt_union(struct vfsmount *);
+extern int readdir_union(struct file *, void *, filldir_t);
extern int last_union_is_root(struct path *);
extern int is_dir_unioned(struct path *);
extern int union_relookup_topmost(struct nameidata *, int);
@@ -61,6 +63,8 @@ extern struct dentry *union_create_topmo
extern int __union_copyup(struct path *, struct nameidata *, struct path *);
extern int union_copyup(struct nameidata *, int);

+extern struct mutex union_rdmutex;
+
#else /* CONFIG_UNION_MOUNT */

#define IS_UNION(x) (0)
@@ -82,5 +86,29 @@ extern int union_copyup(struct nameidata

#endif /* CONFIG_UNION_MOUNT */

+static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
+{
+ int res = 0;
+ struct inode *inode = file->f_path.dentry->d_inode;
+
+ mutex_lock(&inode->i_mutex);
+#ifdef CONFIG_UNION_MOUNT
+ if (IS_MNT_UNION(file->f_path.mnt) && is_dir_unioned(&file->f_path)) {
+ mutex_lock(&union_rdmutex);
+ res = readdir_union(file, buf, filler);
+ mutex_unlock(&union_rdmutex);
+ } else
+#endif
+ {
+ res = -ENOENT;
+ if (!IS_DEADDIR(inode)) {
+ res = file->f_op->readdir(file, buf, filler);
+ file_accessed(file);
+ }
+ }
+ mutex_unlock(&inode->i_mutex);
+ return res;
+}
+
#endif /* __KERNEL__ */
#endif /* __LINUX_UNION_H */

2007-12-05 14:41:20

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 4/5] Directory seek support

Directory seek support.

Define the seek behaviour on the stored cache of dirents.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/read_write.c | 11 ---
fs/union.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 8 ++
include/linux/union.h | 25 +++++++
4 files changed, 205 insertions(+), 10 deletions(-)

--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
#include <linux/syscalls.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
+#include <linux/union.h>
#include "read_write.h"

#include <asm/uaccess.h>
@@ -116,15 +117,7 @@ EXPORT_SYMBOL(default_llseek);

loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
{
- loff_t (*fn)(struct file *, loff_t, int);
-
- fn = no_llseek;
- if (file->f_mode & FMODE_LSEEK) {
- fn = default_llseek;
- if (file->f_op && file->f_op->llseek)
- fn = file->f_op->llseek;
- }
- return fn(file, offset, origin);
+ return do_llseek(file, offset, origin);
}
EXPORT_SYMBOL(vfs_llseek);

--- a/fs/union.c
+++ b/fs/union.c
@@ -614,6 +614,7 @@ static int rdcache_add_entry(struct rdst
this->dtype = d_type;
INIT_LIST_HEAD(&this->list);
list_add_tail(&this->list, list);
+ r->cur_dirent = this;
return 0;
}

@@ -636,18 +637,96 @@ static int filldir_union(void *buf, cons
if (rdcache_find_entry(&r->dirent_cache, name, namlen))
return 0;

- err = cb->filldir(cb->buf, name, namlen, r->cur_off,
+ /* We come here with NULL cb->filldir from lseek path */
+ if (cb->filldir)
+ err = cb->filldir(cb->buf, name, namlen, r->cur_off,
ino, d_type);
if (err >= 0) {
rdcache_add_entry(r, &r->dirent_cache,
name, namlen, offset, ino, d_type);
r->cur_off = ++r->last_off;
r->nr_dirents++;
+ if (r->cur_off == r->fill_off) {
+ /* We filled up to the required seek offset */
+ r->fill_off = 0;
+ err = -EINVAL;
+ }
}
cb->error = err;
return err;
}

+/*
+ * This is called when current offset in rdcache gets changed and when
+ * we need to change the current underlying directory in the rdstate
+ * to match the current offset.
+ */
+static void update_rdstate(struct file *file)
+{
+ struct rdstate *r = file->f_rdstate;
+ loff_t off = r->cur_off;
+ struct union_mount *um;
+
+ if (!(r->flags & RDSTATE_NEED_UPDATE))
+ return;
+
+ spin_lock(&union_lock);
+ um = union_lookup(file->f_path.dentry, file->f_path.mnt);
+ spin_unlock(&union_lock);
+ if (!um)
+ goto out;
+ off -= um->nr_dirents;
+ path_put(&r->cur_path);
+ r->cur_path = file->f_path;
+ path_get(&r->cur_path);
+
+ while (off > 0) {
+ spin_lock(&union_lock);
+ um = union_lookup(r->cur_path.dentry, r->cur_path.mnt);
+ spin_unlock(&union_lock);
+ if (!um)
+ goto out;
+ off -= um->nr_dirents;
+ path_put(&r->cur_path);
+ r->cur_path = um->u_next;
+ path_get(&r->cur_path);
+ }
+out:
+ r->file_off = r->cur_dirent->off;
+}
+
+/*
+ * Returns dirents from rdcache to userspace.
+ */
+static int readdir_rdcache(struct file *file, struct rdcache_callback *cb)
+{
+ struct rdstate *r = cb->rdstate;
+ struct rdcache_entry *tmp = r->cur_dirent;
+ int err = 0;
+
+ BUG_ON(r->cur_off > r->last_off);
+
+ /* If offsets already uptodate, just return */
+ if (likely(r->cur_off == r->last_off))
+ return 0;
+
+ /*
+ * return the entries from cur_off till last_off from rdcache to
+ * user space.
+ */
+ list_for_each_entry_from(tmp, &r->dirent_cache, list) {
+ err = cb->filldir(cb->buf, tmp->name.name, tmp->name.len,
+ r->cur_off, tmp->ino, tmp->dtype);
+ r->cur_dirent = tmp;
+ if (err < 0)
+ break;
+ r->cur_off++;
+ r->flags |= RDSTATE_NEED_UPDATE;
+ }
+ update_rdstate(file);
+ return err;
+}
+
/* Called from last fput() */
void put_rdstate(struct rdstate *rdstate)
{
@@ -710,6 +789,10 @@ int readdir_union(struct file *file, voi
cb.rdstate = rdstate;
cb.error = 0;

+ err = readdir_rdcache(file, &cb);
+ if (err)
+ return err;
+
offset = rdstate->file_off;

/* Read from the topmost directory */
@@ -796,6 +879,92 @@ out:
return err;
}

+static void rdcache_rewind(struct file *file, struct rdstate *r, loff_t offset)
+{
+ struct rdcache_entry *tmp = r->cur_dirent;
+
+ list_for_each_entry_reverse_from(tmp, &r->dirent_cache, list) {
+ if (r->cur_off == offset)
+ break;
+ r->cur_dirent = tmp;
+ r->cur_off--;
+ r->flags |= RDSTATE_NEED_UPDATE;
+ }
+ update_rdstate(file);
+}
+
+static void rdcache_forward(struct file *file, struct rdstate *r, loff_t offset)
+{
+ struct rdcache_entry *tmp = r->cur_dirent;
+
+ list_for_each_entry_continue(tmp, &r->dirent_cache, list) {
+ if (r->cur_off == offset)
+ break;
+ r->cur_dirent = tmp;
+ r->cur_off++;
+ r->flags |= RDSTATE_NEED_UPDATE;
+ }
+ update_rdstate(file);
+}
+
+loff_t llseek_union(struct file *file, loff_t offset, int origin)
+{
+ loff_t err;
+ struct rdstate *r = file->f_rdstate;
+ loff_t orig_off = file->f_rdstate->cur_off;
+
+ if (!r)
+ return -EINVAL;
+
+ switch (origin) {
+ case SEEK_END:
+ /*
+ * To support this, we need to know the end of the directory,
+ * which may need reading _all_ the entries from the filesystem
+ * to readdir cache, which can be expensive.
+ *
+ * After we have the last entry in the cache, do
+ * offset += r->last_off;
+ */
+ err = -EINVAL;
+ goto out;
+ case SEEK_CUR:
+ offset += r->cur_off;
+ break;
+ }
+ err = -EINVAL;
+ if (offset < 0)
+ goto out;
+
+ if (offset > r->cur_off) {
+ /* Seek forward into the cache */
+ rdcache_forward(file, r, offset);
+ if (offset > r->cur_off) {
+ /*
+ * Need to seek beyond the cache, read dirents from into
+ * underlying directories into rdcache.
+ */
+ r->fill_off = offset;
+ err = readdir_union(file, NULL, NULL);
+ if (err < 0)
+ goto out;
+
+ /* readdir() failed, restore the original offset. */
+ if (offset != r->cur_off) {
+ rdcache_rewind(file, r, orig_off);
+ err = -EINVAL;
+ goto out;
+ }
+ err = offset;
+ }
+ } else {
+ rdcache_rewind(file, r, offset);
+ err = offset;
+ }
+out:
+ return err;
+}
+
/*
* Union mount copyup support
*/
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -803,9 +803,17 @@ struct rdstate {
loff_t last_off; /* Offset to last dirent in rdcache */
loff_t nr_dirents; /* Number of entries from current underlying
directory in rdcache */
+ loff_t fill_off; /* Fill cache upto this offset. Used during
+ seek */
struct list_head dirent_cache; /* cache of directory entries */
+ struct rdcache_entry *cur_dirent; /* pointer to current directory
+ entry in rdcache which corresponds to cur_off */
+ int flags;
};

+#define RDSTATE_STALE 0x01
+#define RDSTATE_NEED_UPDATE 0x02
+
extern void put_rdstate(struct rdstate *rdstate);

#else
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -55,6 +55,7 @@ extern int attach_mnt_union(struct vfsmo
struct dentry *);
extern void detach_mnt_union(struct vfsmount *);
extern int readdir_union(struct file *, void *, filldir_t);
+extern loff_t llseek_union(struct file *, loff_t, int);
extern int last_union_is_root(struct path *);
extern int is_dir_unioned(struct path *);
extern int union_relookup_topmost(struct nameidata *, int);
@@ -110,5 +111,29 @@ static inline int do_readdir(struct file
return res;
}

+static inline loff_t do_llseek(struct file *file, loff_t offset, int origin)
+{
+ long long res;
+#ifdef CONFIG_UNION_MOUNT
+ if (IS_MNT_UNION(file->f_path.mnt) && is_dir_unioned(&file->f_path)) {
+ mutex_lock(&union_rdmutex);
+ res = llseek_union(file, offset, origin);
+ mutex_unlock(&union_rdmutex);
+ } else
+#endif
+ {
+ loff_t (*fn)(struct file *, loff_t, int);
+
+ fn = no_llseek;
+ if (file->f_mode & FMODE_LSEEK) {
+ fn = default_llseek;
+ if (file->f_op && file->f_op->llseek)
+ fn = file->f_op->llseek;
+ }
+ res = fn(file, offset, origin);
+ }
+ return res;
+}
+
#endif /* __KERNEL__ */
#endif /* __LINUX_UNION_H */

2007-12-05 14:41:58

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 5/5] Directory cache invalidation

Changes to keep dirent cache uptodate.

Dirent cache stored as part of topmost directory's struct file needs to
be marked stale whenever there is a modification in any of the directories
that is part of the union. Modifications(like addition/deletion of new
entries) to a directory can occur from places like mkdir, rmdir, mknod etc.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/dcache.c | 1
fs/namei.c | 13 +++
fs/union.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/dcache.h | 4 -
include/linux/fs.h | 4 +
include/linux/union.h | 3
6 files changed, 201 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -974,6 +974,7 @@ struct dentry *d_alloc(struct dentry * p
#endif
#ifdef CONFIG_UNION_MOUNT
INIT_LIST_HEAD(&dentry->d_unions);
+ INIT_LIST_HEAD(&dentry->d_overlaid);
dentry->d_unionized = 0;
#endif
INIT_HLIST_NODE(&dentry->d_hash);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2237,6 +2237,7 @@ static int __open_namei_create(struct na
nd->path.dentry = path->dentry;
if (error)
return error;
+ rdcache_invalidate(&nd->path);
/* Don't check for write permission, don't truncate */
return may_open(nd, 0, flag & ~O_TRUNC);
}
@@ -2682,6 +2683,8 @@ asmlinkage long sys_mknodat(int dfd, con
mode, 0);
break;
}
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
out_dput:
path_put_conditional(&path, &nd);
@@ -2757,6 +2760,8 @@ asmlinkage long sys_mkdirat(int dfd, con
if (error)
goto out_dput;
error = vfs_mkdir(nd.path.dentry->d_inode, path.dentry, mode);
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
out_dput:
path_put_conditional(&path, &nd);
@@ -3287,6 +3292,8 @@ static long do_rmdir(int dfd, const char
if (error)
goto exit3;
error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
exit3:
path_put_conditional(&path, &nd);
@@ -3375,6 +3382,8 @@ static long do_unlinkat(int dfd, const c
if (error)
goto exit2;
error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
exit2:
path_put_conditional(&path, &nd);
@@ -3466,6 +3475,8 @@ asmlinkage long sys_symlinkat(const char
goto out_dput;
error = vfs_symlink(nd.path.dentry->d_inode, path.dentry, from,
S_IALLUGO);
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
out_dput:
path_put_conditional(&path, &nd);
@@ -3566,6 +3577,8 @@ asmlinkage long sys_linkat(int olddfd, c
goto out_dput;
error = vfs_link(old_nd.path.dentry, nd.path.dentry->d_inode,
path.dentry);
+ if (!error)
+ rdcache_invalidate(&nd.path);
mnt_drop_write(nd.path.mnt);
out_dput:
path_put_conditional(&path, &nd);
--- a/fs/union.c
+++ b/fs/union.c
@@ -39,6 +39,7 @@ static struct hlist_head *union_hashtabl
static unsigned int union_rhash_mask __read_mostly;
static unsigned int union_rhash_shift __read_mostly;
static struct hlist_head *union_rhashtable __read_mostly;
+static struct hlist_head *readdir_hashtable __read_mostly;

/*
* Locking Rules:
@@ -103,6 +104,18 @@ static int __init init_union(void)
for (loop = 0; loop < (1 << union_rhash_shift); loop++)
INIT_HLIST_HEAD(&union_rhashtable[loop]);

+ readdir_hashtable = alloc_large_system_hash("readdir-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ &union_rhash_shift,
+ &union_rhash_mask,
+ 0);
+
+ for (loop = 0; loop < (1 << union_rhash_shift); loop++)
+ INIT_HLIST_HEAD(&readdir_hashtable[loop]);
+
readdir_cache = kmem_cache_create("readdir-cache",
sizeof(struct rdcache_entry), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
@@ -126,6 +139,7 @@ struct union_mount *union_alloc(struct d
atomic_set(&um->u_count, 1);
INIT_LIST_HEAD(&um->u_unions);
INIT_LIST_HEAD(&um->u_list);
+ INIT_LIST_HEAD(&um->u_overlaid);
INIT_HLIST_NODE(&um->u_hash);
INIT_HLIST_NODE(&um->u_rhash);

@@ -290,6 +304,7 @@ int append_to_union(struct vfsmount *mnt
}
list_add(&this->u_list, &mnt->mnt_unions);
list_add(&this->u_unions, &dentry->d_unions);
+ list_add(&this->u_overlaid, &dest_dentry->d_overlaid);
dest_dentry->d_unionized++;
__union_hash(this);
spin_unlock(&union_lock);
@@ -367,6 +382,22 @@ int follow_union_mount(struct vfsmount *
return res;
}

+int follow_union_up(struct vfsmount **mnt, struct dentry **dentry)
+{
+ struct union_mount *um;
+
+ if (!IS_UNION(*dentry))
+ return 0;
+
+ um = union_rlookup(*dentry, *mnt);
+ if (um) {
+ *dentry = um->u_this.dentry;
+ *mnt = um->u_this.mnt;
+ return 1;
+ }
+ return 0;
+}
+
/*
* is_dir_unioned - check if the directory represented by @path has a union
* stack underneath.
@@ -418,6 +449,7 @@ repeat:
list_del(&this->u_list);
list_del(&this->u_unions);
this->u_next.dentry->d_unionized--;
+ list_del(&this->u_overlaid);
spin_unlock(&union_lock);
union_put(this);
goto repeat;
@@ -447,6 +479,7 @@ repeat:
list_del(&this->u_list);
list_del(&this->u_unions);
this->u_next.dentry->d_unionized--;
+ list_del(&this->u_overlaid);
spin_unlock(&union_lock);
if (__union_put(this)) {
__dput(n_dentry, list);
@@ -474,6 +507,7 @@ repeat:
list_del(&this->u_list);
list_del(&this->u_unions);
this->u_next.dentry->d_unionized--;
+ list_del(&this->u_overlaid);
spin_unlock(&union_lock);
union_put(this);
goto repeat;
@@ -505,6 +539,7 @@ void detach_mnt_union(struct vfsmount *m
list_del(&um->u_list);
list_del(&um->u_unions);
um->u_next.dentry->d_unionized--;
+ list_del(&um->u_overlaid);
spin_unlock(&union_lock);
union_put(um);
return;
@@ -657,6 +692,35 @@ static int filldir_union(void *buf, cons
}

/*
+ * Called when rdcache becomes stale due to modifications to the directory.
+ * Discard the existing rdcache, reread the entries and rebuild the cache
+ * till the current f_pos.
+ * rdcache is marked stale from places like mkdir(), creat() etc.
+ *
+ * CHECK: What if current f_pos can't be established after the change ?
+ */
+static int rebuild_rdcache(struct file *file)
+{
+ struct rdstate *r = file->f_rdstate;
+
+ rdcache_free(&r->dirent_cache);
+
+ path_put(&r->cur_path);
+ r->cur_path = file->f_path;
+ path_get(&r->cur_path);
+
+ /* Fill rdcache until the previously known offset */
+ r->fill_off = r->cur_off - 1;
+
+ r->cur_off = r->last_off = 0;
+ r->cur_dirent = NULL;
+ r->file_off = 0;
+ r->nr_dirents = 0;
+
+ return readdir_union(file, NULL, NULL);
+}
+
+/*
* This is called when current offset in rdcache gets changed and when
* we need to change the current underlying directory in the rdstate
* to match the current offset.
@@ -704,6 +768,13 @@ static int readdir_rdcache(struct file *
struct rdcache_entry *tmp = r->cur_dirent;
int err = 0;

+ if (unlikely(r->flags & RDSTATE_STALE)) {
+ r->flags &= ~RDSTATE_STALE;
+ err = rebuild_rdcache(file);
+ if (err)
+ return err;
+ }
+
BUG_ON(r->cur_off > r->last_off);

/* If offsets already uptodate, just return */
@@ -734,6 +805,11 @@ void put_rdstate(struct rdstate *rdstate
return;

mutex_lock(&union_rdmutex);
+ spin_lock(&union_lock);
+ hlist_del(&rdstate->hash);
+ spin_unlock(&union_lock);
+
+ path_put(&rdstate->path);
path_put(&rdstate->cur_path);
rdcache_free(&rdstate->dirent_cache);
mutex_unlock(&union_rdmutex);
@@ -758,10 +834,16 @@ static struct rdstate *get_rdstate(struc
if (!r)
return ERR_PTR(-ENOMEM);

- r->cur_path = file->f_path;
+ r->path = r->cur_path = file->f_path;
+ path_get(&r->path);
path_get(&r->cur_path);
INIT_LIST_HEAD(&r->dirent_cache);
file->f_rdstate = r;
+ INIT_HLIST_NODE(&r->hash);
+ spin_lock(&union_lock);
+ hlist_add_head(&r->hash, readdir_hashtable +
+ hash(r->path.dentry, r->path.mnt));
+ spin_unlock(&union_lock);
return r;
}

@@ -916,6 +998,13 @@ loff_t llseek_union(struct file *file, l
if (!r)
return -EINVAL;

+ if (unlikely(r->flags & RDSTATE_STALE)) {
+ r->flags &= ~RDSTATE_STALE;
+ err = rebuild_rdcache(file);
+ if (err)
+ goto out;
+ }
+
switch (origin) {
case SEEK_END:
/*
@@ -965,6 +1054,93 @@ out:
return err;
}

+static struct rdstate *lookup_rdstate(struct path *path)
+{
+ struct hlist_head *head = readdir_hashtable +
+ hash(path->dentry, path->mnt);
+ struct hlist_node *node;
+ struct rdstate *r;
+
+ hlist_for_each_entry(r, node, head, hash) {
+ if ((r->path.dentry == path->dentry) &&
+ (r->path.mnt == path->mnt))
+ return r;
+ }
+ return NULL;
+}
+
+static inline void mark_rdstate_stale(struct path *path)
+{
+ struct rdstate *r = lookup_rdstate(path);
+
+ if (r)
+ r->flags |= RDSTATE_STALE;
+ return;
+}
+
+/*
+ * Mark rdstates associated with all the unions of this
+ * @path->dentry is part of as stale.
+ */
+static void rdcache_invalidate_union(struct path *path)
+{
+ struct union_mount *um;
+
+ spin_lock(&union_lock);
+ list_for_each_entry(um, &path->dentry->d_unions, u_unions)
+ mark_rdstate_stale(&um->u_this);
+ spin_unlock(&union_lock);
+}
+
+/*
+ * Mark rdstates associated with all the unions where
+ * @path->dentry has been overlaid as stale.
+ */
+static void rdcache_invalidate_overlaid(struct path *path)
+{
+ struct union_mount *um;
+ struct path cur;
+
+ spin_lock(&dcache_lock);
+ spin_lock(&union_lock);
+ list_for_each_entry(um, &path->dentry->d_overlaid, u_overlaid) {
+ /*
+ * CHECK: Here we don't get reference to toplevel dentry of any
+ * union with the understanding that no dentry can go away
+ * with us holding the dcache_lock.
+ */
+ cur = um->u_this;
+ mark_rdstate_stale(&cur);
+ while (follow_union_up(&cur.mnt, &cur.dentry))
+ mark_rdstate_stale(&cur);
+ }
+ spin_unlock(&union_lock);
+ spin_unlock(&dcache_lock);
+}
+
+void rdcache_invalidate(struct path *path)
+{
+ struct path save;
+
+ /* If this dentry is not part of any union, no rdstate to invalidate */
+ if (!IS_UNION(path->dentry))
+ return;
+
+ save = *path;
+ path_get(&save);
+
+ mutex_lock(&union_rdmutex);
+
+ if (!list_empty(&save.dentry->d_unions))
+ rdcache_invalidate_union(&save);
+
+ if (IS_DENTRY_OVERLAID(save.dentry))
+ rdcache_invalidate_overlaid(&save);
+
+ mutex_unlock(&union_rdmutex);
+ path_put(&save);
+}
+
/*
* Union mount copyup support
*/
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -97,9 +97,11 @@ struct dentry {
#ifdef CONFIG_UNION_MOUNT
/*
* The following fields are used by the VFS based union mount
- * implementation. Both are protected by union_lock!
+ * implementation. They are protected by union_lock!
*/
struct list_head d_unions; /* list of union_mount's */
+ struct list_head d_overlaid; /* list of union_mount's from where this
+ dentry is overlaid */
unsigned int d_unionized; /* unions referencing this dentry */
#endif

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -798,6 +798,8 @@ struct rdcache_entry {
struct rdstate {
struct path cur_path; /* Current directory on which readdir is
in progress */
+ struct hlist_node hash; /* used to link into readdir_hashtable */
+ struct path path; /* the path to which this rdstate belongs to */
loff_t file_off; /* File offset of underlying directory */
loff_t cur_off; /* Offset to current dirent in rdcache */
loff_t last_off; /* Offset to last dirent in rdcache */
@@ -815,9 +817,11 @@ struct rdstate {
#define RDSTATE_NEED_UPDATE 0x02

extern void put_rdstate(struct rdstate *rdstate);
+extern void rdcache_invalidate(struct path *path);

#else
#define put_rdstate(x) do { } while (0)
+#define rdcache_invalidate(x) do { } while (0)
#endif

#define FILE_MNT_WRITE_TAKEN 1
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -31,6 +31,7 @@ struct union_mount {
struct mutex u_mutex;
struct list_head u_unions; /* list head for d_unions */
struct list_head u_list; /* list head for mnt_unions */
+ struct list_head u_overlaid; /* list head for d_overlaid */
struct hlist_node u_hash; /* list head for seaching */
struct hlist_node u_rhash; /* list head for reverse seaching */

@@ -42,6 +43,7 @@ struct union_mount {
#define IS_UNION(dentry) (!list_empty(&(dentry)->d_unions) || \
(dentry)->d_unionized)
#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+#define IS_DENTRY_OVERLAID(dentry) (dentry)->d_unionized

extern int is_unionized(struct dentry *, struct vfsmount *);
extern int append_to_union(struct vfsmount *, struct dentry *,
@@ -70,6 +72,7 @@ extern struct mutex union_rdmutex;

#define IS_UNION(x) (0)
#define IS_MNT_UNION(x) (0)
+#define IS_DENTRY_OVERLAID(x) (0)
#define is_unionized(x, y) (0)
#define append_to_union(x1, y1, x2, y2, z) ({ BUG(); (0); })
#define follow_union_down(x, y) ({ (0); })

2007-12-05 15:01:53

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 3/5] Add list_for_each_entry_reverse_from()

Introduce list_for_each_entry_reverse_from() needed by a subsequent patch.

Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/list.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -562,6 +562,19 @@ static inline void list_splice_init_rcu(
pos = list_entry(pos->member.next, typeof(*pos), member))

/**
+ * list_for_each_entry_reverse_from - iterate backwards over list of given
+ * type from the current point
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_struct within the struct.
+ *
+ * Iterate backwards over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_reverse_from(pos, head, member) \
+ for (; prefetch(pos->member.prev), &pos->member != (head); \
+ pos = list_entry(pos->member.prev, typeof(*pos), member))
+
+/**
* list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
* @pos: the type * to use as a loop cursor.
* @n: another type * to use as temporary storage

2007-12-05 16:27:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] Remove existing directory listing implementation

On Wed, 2007-12-05 at 20:08 +0530, Bharata B Rao wrote:
> Remove the existing readdir implementation.

You may have had a better description in your 0/5 mail, but this is what
goes into the git log in the end, so I think you need to beef this up a
bit.

-- Dave

2007-12-05 17:22:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

On Wed, 2007-12-05 at 20:07 +0530, Bharata B Rao wrote:
> In this approach, the cached dirents are given offsets in the form of
> linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
> uniformly define offsets across all the directories of the union
> irrespective of the type of filesystem involved. Also this is needed to
> define a seek behaviour on the union mounted directory. This cache is stored
> as part of the struct file of the topmost directory of the union and will
> remain as long as the directory is kept open.

That's a little brute force, don't you think? Especially the memory
exhaustion seems to really make this an undesirable approach.

I think the key here is what kind of consistency we're trying to
provide. If a directory is being changed underneath a reader, what
kinds of guarantees do they get about the contents of their directory
read? When do those guarantees start? Are there any at open() time?

Rather than give each _dirent_ an offset, could we give each sub-mount
an offset? Let's say we have three members comprising a union mount
directory. The first has 100 dirents, the second 200, and the third
10,000. When the first readdir is done, we populate the table like
this:

mount_offset[0] = 0;
mount_offset[1] = 100;
mount_offset[2] = 300;

If someone seeks back to 150, then we subtrack the mount[1]'s offset
(100), and realize that we want the 50th dirent from mount[1].

I don't know whether we're bound to this:

http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html

"If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified."

But that would seem to tell me that once you populate a table such as
the one I've described and create it at open(dir) time, you don't
actually ever need to update it.

The storage for this is only comparable to the number of mounts that you
have. One issue comes if we manage to overflow our data types with too
many entries in too many submounts. But, I guess we can just truncate
the directory in that case.

-- Dave

2007-12-06 10:01:35

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

On Wed, Dec 05, Dave Hansen wrote:

> I think the key here is what kind of consistency we're trying to
> provide. If a directory is being changed underneath a reader, what
> kinds of guarantees do they get about the contents of their directory
> read? When do those guarantees start? Are there any at open() time?

But we still want to be compliant to what POSIX defines. The problem isn't the
consistency of the readdir result but the seekdir/telldir interface. IMHO that
interface is totally broken: you need to be able to find every offset given by
telldir since the last open. The problem is that seekdir isn't able to return
errors. Otherwise you could just forbid seeking on union directories.

> Rather than give each _dirent_ an offset, could we give each sub-mount
> an offset? Let's say we have three members comprising a union mount
> directory. The first has 100 dirents, the second 200, and the third
> 10,000. When the first readdir is done, we populate the table like
> this:
>
> mount_offset[0] = 0;
> mount_offset[1] = 100;
> mount_offset[2] = 300;
>
> If someone seeks back to 150, then we subtrack the mount[1]'s offset
> (100), and realize that we want the 50th dirent from mount[1].

Yes, that is a nice idea and it is exactly what I have implemented in my patch
series. But you forgot one thing: directories are not flat files. The dentry
offset in a directory is a random cookie. Therefore it is not possible to have
a linear mapping without allocating memory.

> I don't know whether we're bound to this:
>
> http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
>
> "If a file is removed from or added to the directory after the
> most recent call to opendir() or rewinddir(), whether a
> subsequent call to readdir() returns an entry for that file is
> unspecified."
>
> But that would seem to tell me that once you populate a table such as
> the one I've described and create it at open(dir) time, you don't
> actually ever need to update it.

Yes, I'm using such a patch on our S390 buildservers to work around some
readdir/seek/rm problem with old glibc versions. It seems to work but on the
other hand this are really huge systems and I haven't run out of memory while
doing a readdir yet ;)

The proper way to implement this would be to cache the offsets on a per inode
base. Otherwise the user could easily DoS this by opening a number of
directories and never close them.

Regards,
Jan

--
Jan Blunck <[email protected]>

2007-12-06 15:11:47

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

On Thu, Dec 06, 2007 at 11:01:18AM +0100, Jan Blunck wrote:
> On Wed, Dec 05, Dave Hansen wrote:
>
> > I think the key here is what kind of consistency we're trying to
> > provide. If a directory is being changed underneath a reader, what
> > kinds of guarantees do they get about the contents of their directory
> > read? When do those guarantees start? Are there any at open() time?
>
> But we still want to be compliant to what POSIX defines. The problem isn't the
> consistency of the readdir result but the seekdir/telldir interface. IMHO that
> interface is totally broken: you need to be able to find every offset given by
> telldir since the last open. The problem is that seekdir isn't able to return
> errors. Otherwise you could just forbid seeking on union directories.

Also, what kind of consistency is expected when a directory is open(2)ed
and readdir(2) and lseek(2) are applied to it when the directory gets
changed underneath the reader. From this:
http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
the behaviour/guarantees wasn't apparent to me.

>
> > Rather than give each _dirent_ an offset, could we give each sub-mount
> > an offset? Let's say we have three members comprising a union mount
> > directory. The first has 100 dirents, the second 200, and the third
> > 10,000. When the first readdir is done, we populate the table like
> > this:
> >
> > mount_offset[0] = 0;
> > mount_offset[1] = 100;
> > mount_offset[2] = 300;
> >
> > If someone seeks back to 150, then we subtrack the mount[1]'s offset
> > (100), and realize that we want the 50th dirent from mount[1].
>
> Yes, that is a nice idea and it is exactly what I have implemented in my patch
> series. But you forgot one thing: directories are not flat files. The dentry
> offset in a directory is a random cookie. Therefore it is not possible to have
> a linear mapping without allocating memory.

And I defined this linear behaviour on the cache of dirents we maintain
in the approach I posted. And the main reason we maintain cache of
dirents in memory is for duplicate elimination.

>
> > I don't know whether we're bound to this:
> >
> > http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
> >
> > "If a file is removed from or added to the directory after the
> > most recent call to opendir() or rewinddir(), whether a
> > subsequent call to readdir() returns an entry for that file is
> > unspecified."
> >
> > But that would seem to tell me that once you populate a table such as
> > the one I've described and create it at open(dir) time, you don't
> > actually ever need to update it.
>
> Yes, I'm using such a patch on our S390 buildservers to work around some
> readdir/seek/rm problem with old glibc versions. It seems to work but on the
> other hand this are really huge systems and I haven't run out of memory while
> doing a readdir yet ;)
>
> The proper way to implement this would be to cache the offsets on a per inode
> base. Otherwise the user could easily DoS this by opening a number of
> directories and never close them.
>

You mean cache the offsets or dirents ? How would that solve
the seek problem ? How would it enable you to define a seek behaviour
for the entire union of directories ?

Regards,
Bharata.

2007-12-06 17:55:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

On Thu, 2007-12-06 at 11:01 +0100, Jan Blunck wrote:
> > Rather than give each _dirent_ an offset, could we give each sub-mount
> > an offset? Let's say we have three members comprising a union mount
> > directory. The first has 100 dirents, the second 200, and the third
> > 10,000. When the first readdir is done, we populate the table like
> > this:
> >
> > mount_offset[0] = 0;
> > mount_offset[1] = 100;
> > mount_offset[2] = 300;
> >
> > If someone seeks back to 150, then we subtrack the mount[1]'s offset
> > (100), and realize that we want the 50th dirent from mount[1].
>
> Yes, that is a nice idea and it is exactly what I have implemented in my patch
> series. But you forgot one thing: directories are not flat files. The dentry
> offset in a directory is a random cookie. Therefore it is not possible to have
> a linear mapping without allocating memory.

Is is truly a random cookie where the fs gets to put anything in there
that it can fit in a long? Isn't that an advantage? Being a random
cookie, we can encode anything we want in there. We could encode the
"mount index" (or whatever you want to call it) in high or low bits and
have the rest store the fs-specific offset. The only problem there
would be running out of storage space in long.

But, what do people expect when they have huge directories (or huge
directory offsets)? Surely, if their fs is already pressing the fs's
directory data types to the limits, they can't simply union mount a
couple of those directories together and expect magic. Union mounts
also can't be expected to compress these directory positions to fit.

So, I think it's reasonable behavior to readdir() until the sum of the
highest position seen on each mount would overflow the off_t that we
have to store it in.

-- Dave

2007-12-07 02:14:44

by sfjro

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support


Bharata B Rao:
> - The cache can grow arbitrarily large in size for big directories thereby
> consuming lots of memory. Pruning individual cache entries is out of question
> as entire cache is needed for subsequent readdirs for duplicate elimination.

Additionally, the memory usage may be a problem too since your
implementation calls kmalloc() for every names.


> - Whenever _any_ directory that is part of the union gets
> modified (addition/deletion of entries), the dirent cache of all the unions
> which this directory is part of, needs to be purged and rebuilt. This is
> expensive not only due to re-reads of dirents but also because
> readdir(2)/getdents(2) needs to be synchronized with other operations
> like mkdir/mknod/link/unlink etc.

The cache in struct file doesn't need to be refreshed unless rewinddir()
is issued. Also you can maintain the cache in every add/del entries,
instead of discarding the cache entirely.


> After all this, I am beginning to think if it would be better to delegate
> this readdir and whiteout processing to userspace. Can this be better handled

Yes, I had such idea once. And copy-up too. They can be done in
userspace (while you need to be careful about the privilege).

Anyway I agree with you. As I wrote before, this approach consumes a lot
of memory and cpu (for comparing whiteouted names).


Junjiro Okajima