2010-11-09 12:46:20

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/6] fs: icache RCU free inodes

So here is the inode RCU code. It's obviously not worth doing until the
actual rcu-walk path walking is in, but I'd like to get opinions on it.
It would be nice to merge it in Al's tree at some point, though.
--

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- Icache locking to use RCU.
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller. These
cases are much more realistic in general (except perhaps sockets, which
are covered by the next patch).

The cache-hot regression can largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds complexity to list walking and store-free path walking, so
I prefer to implement this at a later date, if it is shown to be a win in real
situations. I haven't found a regression in any non-micro benchmark so I
don't think it will be a huge problem.

Signed-off-by: Nick Piggin <[email protected]>

---
Documentation/filesystems/porting | 12 ++++++++++++
arch/powerpc/platforms/cell/spufs/inode.c | 10 ++++++++--
drivers/staging/pohmelfs/inode.c | 11 +++++++++--
fs/9p/vfs_inode.c | 9 ++++++++-
fs/adfs/super.c | 9 ++++++++-
fs/affs/super.c | 9 ++++++++-
fs/afs/super.c | 10 +++++++++-
fs/befs/linuxvfs.c | 10 ++++++++--
fs/bfs/inode.c | 9 ++++++++-
fs/block_dev.c | 9 ++++++++-
fs/btrfs/inode.c | 9 ++++++++-
fs/ceph/inode.c | 11 ++++++++++-
fs/cifs/cifsfs.c | 9 ++++++++-
fs/coda/inode.c | 9 ++++++++-
fs/ecryptfs/super.c | 12 +++++++++++-
fs/efs/super.c | 9 ++++++++-
fs/exofs/super.c | 9 ++++++++-
fs/ext2/super.c | 9 ++++++++-
fs/ext3/super.c | 9 ++++++++-
fs/ext4/super.c | 9 ++++++++-
fs/fat/inode.c | 9 ++++++++-
fs/freevxfs/vxfs_inode.c | 9 ++++++++-
fs/fuse/inode.c | 9 ++++++++-
fs/gfs2/super.c | 9 ++++++++-
fs/hfs/super.c | 9 ++++++++-
fs/hfsplus/super.c | 10 +++++++++-
fs/hostfs/hostfs_kern.c | 9 ++++++++-
fs/hpfs/super.c | 9 ++++++++-
fs/hppfs/hppfs.c | 9 ++++++++-
fs/hugetlbfs/inode.c | 9 ++++++++-
fs/inode.c | 10 +++++++++-
fs/isofs/inode.c | 9 ++++++++-
fs/jffs2/super.c | 9 ++++++++-
fs/jfs/super.c | 10 +++++++++-
fs/logfs/inode.c | 9 ++++++++-
fs/minix/inode.c | 9 ++++++++-
fs/ncpfs/inode.c | 9 ++++++++-
fs/nfs/inode.c | 9 ++++++++-
fs/nilfs2/super.c | 10 +++++++++-
fs/ntfs/inode.c | 9 ++++++++-
fs/ocfs2/dlmfs/dlmfs.c | 9 ++++++++-
fs/ocfs2/super.c | 9 ++++++++-
fs/openpromfs/inode.c | 9 ++++++++-
fs/proc/inode.c | 9 ++++++++-
fs/qnx4/inode.c | 9 ++++++++-
fs/reiserfs/super.c | 9 ++++++++-
fs/romfs/super.c | 9 ++++++++-
fs/squashfs/super.c | 9 ++++++++-
fs/sysv/inode.c | 9 ++++++++-
fs/ubifs/super.c | 10 +++++++++-
fs/udf/super.c | 9 ++++++++-
fs/ufs/super.c | 9 ++++++++-
fs/xfs/xfs_iget.c | 13 ++++++++++++-
include/linux/fs.h | 5 ++++-
include/linux/net.h | 1 -
ipc/mqueue.c | 9 ++++++++-
mm/shmem.c | 9 ++++++++-
net/socket.c | 16 ++++++++--------
net/sunrpc/rpc_pipe.c | 10 +++++++++-
59 files changed, 481 insertions(+), 68 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ext2/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -161,11 +161,18 @@ static struct inode *ext2_alloc_inode(st
return &ei->vfs_inode;
}

-static void ext2_destroy_inode(struct inode *inode)
+static void ext2_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
}

+static void ext2_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, ext2_i_callback);
+}
+
static void init_once(void *foo)
{
struct ext2_inode_info *ei = (struct ext2_inode_info *) foo;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/inode.c 2010-11-09 23:22:54.000000000 +1100
@@ -270,6 +270,13 @@ void __destroy_inode(struct inode *inode
}
EXPORT_SYMBOL(__destroy_inode);

+static void i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(inode_cachep, inode);
+}
+
static void destroy_inode(struct inode *inode)
{
BUG_ON(!list_empty(&inode->i_lru));
@@ -277,7 +284,7 @@ static void destroy_inode(struct inode *
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
- kmem_cache_free(inode_cachep, (inode));
+ call_rcu(&inode->i_rcu, i_callback);
}

/*
@@ -430,6 +437,7 @@ void end_writeback(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
inode_sync_wait(inode);
+ /* don't need i_lock here, no concurrent mods to i_state */
inode->i_state = I_FREEING | I_CLEAR;
}
EXPORT_SYMBOL(end_writeback);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/include/linux/fs.h 2010-11-09 23:22:54.000000000 +1100
@@ -736,7 +736,10 @@ struct inode {
struct list_head i_wb_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
- struct list_head i_dentry;
+ union {
+ struct list_head i_dentry;
+ struct rcu_head i_rcu;
+ };
unsigned long i_ino;
atomic_t i_count;
unsigned int i_nlink;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/block_dev.c 2010-11-09 22:11:10.000000000 +1100
@@ -410,13 +410,20 @@ static struct inode *bdev_alloc_inode(st
return &ei->vfs_inode;
}

-static void bdev_destroy_inode(struct inode *inode)
+static void bdev_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
struct bdev_inode *bdi = BDEV_I(inode);

+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(bdev_cachep, bdi);
}

+static void bdev_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, bdev_i_callback);
+}
+
static void init_once(void *foo)
{
struct bdev_inode *ei = (struct bdev_inode *) foo;
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ext3/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -480,6 +480,13 @@ static struct inode *ext3_alloc_inode(st
return &ei->vfs_inode;
}

+static void ext3_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+}
+
static void ext3_destroy_inode(struct inode *inode)
{
if (!list_empty(&(EXT3_I(inode)->i_orphan))) {
@@ -490,7 +497,7 @@ static void ext3_destroy_inode(struct in
false);
dump_stack();
}
- kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+ call_rcu(&inode->i_rcu, ext3_i_callback);
}

static void init_once(void *foo)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/hugetlbfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -663,11 +663,18 @@ static struct inode *hugetlbfs_alloc_ino
return &p->vfs_inode;
}

+static void hugetlbfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
static void hugetlbfs_destroy_inode(struct inode *inode)
{
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
- kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+ call_rcu(&inode->i_rcu, hugetlbfs_i_callback);
}

static const struct address_space_operations hugetlbfs_aops = {
Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/proc/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -66,11 +66,18 @@ static struct inode *proc_alloc_inode(st
return inode;
}

-static void proc_destroy_inode(struct inode *inode)
+static void proc_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(proc_inode_cachep, PROC_I(inode));
}

+static void proc_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, proc_i_callback);
+}
+
static void init_once(void *foo)
{
struct proc_inode *ei = (struct proc_inode *) foo;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/ipc/mqueue.c 2010-11-09 22:11:10.000000000 +1100
@@ -237,11 +237,18 @@ static struct inode *mqueue_alloc_inode(
return &ei->vfs_inode;
}

-static void mqueue_destroy_inode(struct inode *inode)
+static void mqueue_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
}

+static void mqueue_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, mqueue_i_callback);
+}
+
static void mqueue_evict_inode(struct inode *inode)
{
struct mqueue_inode_info *info;
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c 2010-11-09 22:11:03.000000000 +1100
+++ linux-2.6/net/socket.c 2010-11-09 23:22:54.000000000 +1100
@@ -262,20 +262,20 @@ static struct inode *sock_alloc_inode(st
}


-static void wq_free_rcu(struct rcu_head *head)
+static void sock_free_rcu(struct rcu_head *head)
{
- struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct socket_alloc *ei = container_of(inode, struct socket_alloc,
+ vfs_inode);

- kfree(wq);
+ kfree(ei->socket.wq);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(sock_inode_cachep, ei);
}

static void sock_destroy_inode(struct inode *inode)
{
- struct socket_alloc *ei;
-
- ei = container_of(inode, struct socket_alloc, vfs_inode);
- call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
- kmem_cache_free(sock_inode_cachep, ei);
+ call_rcu(&inode->i_rcu, sock_free_rcu);
}

static void init_once(void *foo)
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/fat/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -514,11 +514,18 @@ static struct inode *fat_alloc_inode(str
return &ei->vfs_inode;
}

-static void fat_destroy_inode(struct inode *inode)
+static void fat_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(fat_inode_cachep, MSDOS_I(inode));
}

+static void fat_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, fat_i_callback);
+}
+
static void init_once(void *foo)
{
struct msdos_inode_info *ei = (struct msdos_inode_info *)foo;
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/nfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -1437,11 +1437,18 @@ struct inode *nfs_alloc_inode(struct sup
return &nfsi->vfs_inode;
}

-void nfs_destroy_inode(struct inode *inode)
+static void nfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
}

+void nfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, nfs_i_callback);
+}
+
static inline void nfs4_init_once(struct nfs_inode *nfsi)
{
#ifdef CONFIG_NFS_V4
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/mm/shmem.c 2010-11-09 22:11:10.000000000 +1100
@@ -2415,13 +2415,20 @@ static struct inode *shmem_alloc_inode(s
return &p->vfs_inode;
}

+static void shmem_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+}
+
static void shmem_destroy_inode(struct inode *inode)
{
if ((inode->i_mode & S_IFMT) == S_IFREG) {
/* only struct inode is valid if it's an inline symlink */
mpol_free_shared_policy(&SHMEM_I(inode)->policy);
}
- kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+ call_rcu(&inode->i_rcu, shmem_i_callback);
}

static void init_once(void *foo)
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c 2010-11-09 22:11:03.000000000 +1100
+++ linux-2.6/net/sunrpc/rpc_pipe.c 2010-11-09 23:22:53.000000000 +1100
@@ -162,11 +162,19 @@ rpc_alloc_inode(struct super_block *sb)
}

static void
-rpc_destroy_inode(struct inode *inode)
+rpc_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(rpc_inode_cachep, RPC_I(inode));
}

+static void
+rpc_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, rpc_i_callback);
+}
+
static int
rpc_pipe_open(struct inode *inode, struct file *filp)
{
Index: linux-2.6/include/linux/net.h
===================================================================
--- linux-2.6.orig/include/linux/net.h 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/include/linux/net.h 2010-11-09 23:22:54.000000000 +1100
@@ -120,7 +120,6 @@ enum sock_shutdown_cmd {
struct socket_wq {
wait_queue_head_t wait;
struct fasync_struct *fasync_list;
- struct rcu_head rcu;
} ____cacheline_aligned_in_smp;

/**
Index: linux-2.6/fs/9p/vfs_inode.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/9p/vfs_inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -237,10 +237,17 @@ struct inode *v9fs_alloc_inode(struct su
*
*/

-void v9fs_destroy_inode(struct inode *inode)
+static void v9fs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(vcookie_cache, v9fs_inode2cookie(inode));
}
+
+void v9fs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, v9fs_i_callback);
+}
#endif

/**
Index: linux-2.6/fs/adfs/super.c
===================================================================
--- linux-2.6.orig/fs/adfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/adfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -240,11 +240,18 @@ static struct inode *adfs_alloc_inode(st
return &ei->vfs_inode;
}

-static void adfs_destroy_inode(struct inode *inode)
+static void adfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(adfs_inode_cachep, ADFS_I(inode));
}

+static void adfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, adfs_i_callback);
+}
+
static void init_once(void *foo)
{
struct adfs_inode_info *ei = (struct adfs_inode_info *) foo;
Index: linux-2.6/fs/affs/super.c
===================================================================
--- linux-2.6.orig/fs/affs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/affs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -95,11 +95,18 @@ static struct inode *affs_alloc_inode(st
return &i->vfs_inode;
}

-static void affs_destroy_inode(struct inode *inode)
+static void affs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(affs_inode_cachep, AFFS_I(inode));
}

+static void affs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, affs_i_callback);
+}
+
static void init_once(void *foo)
{
struct affs_inode_info *ei = (struct affs_inode_info *) foo;
Index: linux-2.6/fs/afs/super.c
===================================================================
--- linux-2.6.orig/fs/afs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/afs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -498,6 +498,14 @@ static struct inode *afs_alloc_inode(str
return &vnode->vfs_inode;
}

+static void afs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct afs_vnode *vnode = AFS_FS_I(inode);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(afs_inode_cachep, vnode);
+}
+
/*
* destroy an AFS inode struct
*/
@@ -511,7 +519,7 @@ static void afs_destroy_inode(struct ino

ASSERTCMP(vnode->server, ==, NULL);

- kmem_cache_free(afs_inode_cachep, vnode);
+ call_rcu(&inode->i_rcu, afs_i_callback);
atomic_dec(&afs_count_active_inodes);
}

Index: linux-2.6/fs/befs/linuxvfs.c
===================================================================
--- linux-2.6.orig/fs/befs/linuxvfs.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/befs/linuxvfs.c 2010-11-09 22:11:10.000000000 +1100
@@ -284,12 +284,18 @@ befs_alloc_inode(struct super_block *sb)
return &bi->vfs_inode;
}

-static void
-befs_destroy_inode(struct inode *inode)
+static void befs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(befs_inode_cachep, BEFS_I(inode));
}

+static void befs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, befs_i_callback);
+}
+
static void init_once(void *foo)
{
struct befs_inode_info *bi = (struct befs_inode_info *) foo;
Index: linux-2.6/fs/bfs/inode.c
===================================================================
--- linux-2.6.orig/fs/bfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/bfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -248,11 +248,18 @@ static struct inode *bfs_alloc_inode(str
return &bi->vfs_inode;
}

-static void bfs_destroy_inode(struct inode *inode)
+static void bfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(bfs_inode_cachep, BFS_I(inode));
}

+static void bfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, bfs_i_callback);
+}
+
static void init_once(void *foo)
{
struct bfs_inode_info *bi = foo;
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c 2010-11-09 23:22:53.000000000 +1100
@@ -6322,6 +6322,13 @@ struct inode *btrfs_alloc_inode(struct s
return inode;
}

+static void btrfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
void btrfs_destroy_inode(struct inode *inode)
{
struct btrfs_ordered_extent *ordered;
@@ -6391,7 +6398,7 @@ void btrfs_destroy_inode(struct inode *i
inode_tree_del(inode);
btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
free:
- kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+ call_rcu(&inode->i_rcu, btrfs_i_callback);
}

int btrfs_drop_inode(struct inode *inode)
Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ceph/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -369,6 +369,15 @@ struct inode *ceph_alloc_inode(struct su
return &ci->vfs_inode;
}

+static void ceph_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct ceph_inode_info *ci = ceph_inode(inode);
+
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ceph_inode_cachep, ci);
+}
+
void ceph_destroy_inode(struct inode *inode)
{
struct ceph_inode_info *ci = ceph_inode(inode);
@@ -408,7 +417,7 @@ void ceph_destroy_inode(struct inode *in
if (ci->i_xattrs.prealloc_blob)
ceph_buffer_put(ci->i_xattrs.prealloc_blob);

- kmem_cache_free(ceph_inode_cachep, ci);
+ call_rcu(&inode->i_rcu, ceph_i_callback);
}


Index: linux-2.6/fs/cifs/cifsfs.c
===================================================================
--- linux-2.6.orig/fs/cifs/cifsfs.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/cifs/cifsfs.c 2010-11-09 22:11:10.000000000 +1100
@@ -334,10 +334,17 @@ cifs_alloc_inode(struct super_block *sb)
return &cifs_inode->vfs_inode;
}

+static void cifs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+}
+
static void
cifs_destroy_inode(struct inode *inode)
{
- kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+ call_rcu(&inode->i_rcu, cifs_i_callback);
}

static void
Index: linux-2.6/fs/coda/inode.c
===================================================================
--- linux-2.6.orig/fs/coda/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/coda/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -56,11 +56,18 @@ static struct inode *coda_alloc_inode(st
return &ei->vfs_inode;
}

-static void coda_destroy_inode(struct inode *inode)
+static void coda_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(coda_inode_cachep, ITOC(inode));
}

+static void coda_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, coda_i_callback);
+}
+
static void init_once(void *foo)
{
struct coda_inode_info *ei = (struct coda_inode_info *) foo;
Index: linux-2.6/fs/ecryptfs/super.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ecryptfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -63,6 +63,16 @@ static struct inode *ecryptfs_alloc_inod
return inode;
}

+static void ecryptfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct ecryptfs_inode_info *inode_info;
+ inode_info = ecryptfs_inode_to_private(inode);
+
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+}
+
/**
* ecryptfs_destroy_inode
* @inode: The ecryptfs inode
@@ -89,7 +99,7 @@ static void ecryptfs_destroy_inode(struc
}
}
ecryptfs_destroy_crypt_stat(&inode_info->crypt_stat);
- kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+ call_rcu(&inode->i_rcu, ecryptfs_i_callback);
}

/**
Index: linux-2.6/fs/efs/super.c
===================================================================
--- linux-2.6.orig/fs/efs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/efs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -65,11 +65,18 @@ static struct inode *efs_alloc_inode(str
return &ei->vfs_inode;
}

-static void efs_destroy_inode(struct inode *inode)
+static void efs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(efs_inode_cachep, INODE_INFO(inode));
}

+static void efs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, efs_i_callback);
+}
+
static void init_once(void *foo)
{
struct efs_inode_info *ei = (struct efs_inode_info *) foo;
Index: linux-2.6/fs/exofs/super.c
===================================================================
--- linux-2.6.orig/fs/exofs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/exofs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -150,12 +150,19 @@ static struct inode *exofs_alloc_inode(s
return &oi->vfs_inode;
}

+static void exofs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+}
+
/*
* Remove an inode from the cache
*/
static void exofs_destroy_inode(struct inode *inode)
{
- kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+ call_rcu(&inode->i_rcu, exofs_i_callback);
}

/*
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ext4/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -841,6 +841,13 @@ static int ext4_drop_inode(struct inode
return drop;
}

+static void ext4_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+}
+
static void ext4_destroy_inode(struct inode *inode)
{
ext4_ioend_wait(inode);
@@ -853,7 +860,7 @@ static void ext4_destroy_inode(struct in
true);
dump_stack();
}
- kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+ call_rcu(&inode->i_rcu, ext4_i_callback);
}

static void init_once(void *foo)
Index: linux-2.6/fs/freevxfs/vxfs_inode.c
===================================================================
--- linux-2.6.orig/fs/freevxfs/vxfs_inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/freevxfs/vxfs_inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -337,6 +337,13 @@ vxfs_iget(struct super_block *sbp, ino_t
return ip;
}

+static void vxfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(vxfs_inode_cachep, inode->i_private);
+}
+
/**
* vxfs_evict_inode - remove inode from main memory
* @ip: inode to discard.
@@ -350,5 +357,5 @@ vxfs_evict_inode(struct inode *ip)
{
truncate_inode_pages(&ip->i_data, 0);
end_writeback(ip);
- kmem_cache_free(vxfs_inode_cachep, ip->i_private);
+ call_rcu(&ip->i_rcu, vxfs_i_callback);
}
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/fuse/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -99,6 +99,13 @@ static struct inode *fuse_alloc_inode(st
return inode;
}

+static void fuse_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(fuse_inode_cachep, inode);
+}
+
static void fuse_destroy_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -106,7 +113,7 @@ static void fuse_destroy_inode(struct in
BUG_ON(!list_empty(&fi->queued_writes));
if (fi->forget_req)
fuse_request_free(fi->forget_req);
- kmem_cache_free(fuse_inode_cachep, inode);
+ call_rcu(&inode->i_rcu, fuse_i_callback);
}

void fuse_send_forget(struct fuse_conn *fc, struct fuse_req *req,
Index: linux-2.6/fs/gfs2/super.c
===================================================================
--- linux-2.6.orig/fs/gfs2/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/gfs2/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -1405,11 +1405,18 @@ static struct inode *gfs2_alloc_inode(st
return &ip->i_inode;
}

-static void gfs2_destroy_inode(struct inode *inode)
+static void gfs2_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(gfs2_inode_cachep, inode);
}

+static void gfs2_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, gfs2_i_callback);
+}
+
const struct super_operations gfs2_super_ops = {
.alloc_inode = gfs2_alloc_inode,
.destroy_inode = gfs2_destroy_inode,
Index: linux-2.6/fs/hfs/super.c
===================================================================
--- linux-2.6.orig/fs/hfs/super.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/hfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -167,11 +167,18 @@ static struct inode *hfs_alloc_inode(str
return i ? &i->vfs_inode : NULL;
}

-static void hfs_destroy_inode(struct inode *inode)
+static void hfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(hfs_inode_cachep, HFS_I(inode));
}

+static void hfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, hfs_i_callback);
+}
+
static const struct super_operations hfs_super_operations = {
.alloc_inode = hfs_alloc_inode,
.destroy_inode = hfs_destroy_inode,
Index: linux-2.6/fs/hfsplus/super.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/hfsplus/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -488,11 +488,19 @@ static struct inode *hfsplus_alloc_inode
return i ? &i->vfs_inode : NULL;
}

-static void hfsplus_destroy_inode(struct inode *inode)
+static void hfsplus_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(hfsplus_inode_cachep, HFSPLUS_I(inode));
}

+static void hfsplus_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, hfsplus_i_callback);
+}
+
#define HFSPLUS_INODE_SIZE sizeof(struct hfsplus_inode_info)

static struct dentry *hfsplus_mount(struct file_system_type *fs_type,
Index: linux-2.6/fs/hpfs/super.c
===================================================================
--- linux-2.6.orig/fs/hpfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/hpfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -177,11 +177,18 @@ static struct inode *hpfs_alloc_inode(st
return &ei->vfs_inode;
}

-static void hpfs_destroy_inode(struct inode *inode)
+static void hpfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(hpfs_inode_cachep, hpfs_i(inode));
}

+static void hpfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, hpfs_i_callback);
+}
+
static void init_once(void *foo)
{
struct hpfs_inode_info *ei = (struct hpfs_inode_info *) foo;
Index: linux-2.6/fs/isofs/inode.c
===================================================================
--- linux-2.6.orig/fs/isofs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/isofs/inode.c 2010-11-09 23:22:51.000000000 +1100
@@ -65,11 +65,18 @@ static struct inode *isofs_alloc_inode(s
return &ei->vfs_inode;
}

-static void isofs_destroy_inode(struct inode *inode)
+static void isofs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(isofs_inode_cachep, ISOFS_I(inode));
}

+static void isofs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, isofs_i_callback);
+}
+
static void init_once(void *foo)
{
struct iso_inode_info *ei = foo;
Index: linux-2.6/fs/jffs2/super.c
===================================================================
--- linux-2.6.orig/fs/jffs2/super.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/jffs2/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -40,11 +40,18 @@ static struct inode *jffs2_alloc_inode(s
return &f->vfs_inode;
}

-static void jffs2_destroy_inode(struct inode *inode)
+static void jffs2_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(jffs2_inode_cachep, JFFS2_INODE_INFO(inode));
}

+static void jffs2_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, jffs2_i_callback);
+}
+
static void jffs2_i_init_once(void *foo)
{
struct jffs2_inode_info *f = foo;
Index: linux-2.6/fs/jfs/super.c
===================================================================
--- linux-2.6.orig/fs/jfs/super.c 2010-11-09 22:11:03.000000000 +1100
+++ linux-2.6/fs/jfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -115,6 +115,14 @@ static struct inode *jfs_alloc_inode(str
return &jfs_inode->vfs_inode;
}

+static void jfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct jfs_inode_info *ji = JFS_IP(inode);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(jfs_inode_cachep, ji);
+}
+
static void jfs_destroy_inode(struct inode *inode)
{
struct jfs_inode_info *ji = JFS_IP(inode);
@@ -128,7 +136,7 @@ static void jfs_destroy_inode(struct ino
ji->active_ag = -1;
}
spin_unlock_irq(&ji->ag_lock);
- kmem_cache_free(jfs_inode_cachep, ji);
+ call_rcu(&inode->i_rcu, jfs_i_callback);
}

static int jfs_statfs(struct dentry *dentry, struct kstatfs *buf)
Index: linux-2.6/fs/logfs/inode.c
===================================================================
--- linux-2.6.orig/fs/logfs/inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/logfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -141,13 +141,20 @@ struct inode *logfs_safe_iget(struct sup
return __logfs_iget(sb, ino);
}

+static void logfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(logfs_inode_cache, logfs_inode(inode));
+}
+
static void __logfs_destroy_inode(struct inode *inode)
{
struct logfs_inode *li = logfs_inode(inode);

BUG_ON(li->li_block);
list_del(&li->li_freeing_list);
- kmem_cache_free(logfs_inode_cache, li);
+ call_rcu(&inode->i_rcu, logfs_i_callback);
}

static void logfs_destroy_inode(struct inode *inode)
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/minix/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -68,11 +68,18 @@ static struct inode *minix_alloc_inode(s
return &ei->vfs_inode;
}

-static void minix_destroy_inode(struct inode *inode)
+static void minix_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(minix_inode_cachep, minix_i(inode));
}

+static void minix_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, minix_i_callback);
+}
+
static void init_once(void *foo)
{
struct minix_inode_info *ei = (struct minix_inode_info *) foo;
Index: linux-2.6/fs/ncpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ncpfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -59,11 +59,18 @@ static struct inode *ncp_alloc_inode(str
return &ei->vfs_inode;
}

-static void ncp_destroy_inode(struct inode *inode)
+static void ncp_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(ncp_inode_cachep, NCP_FINFO(inode));
}

+static void ncp_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, ncp_i_callback);
+}
+
static void init_once(void *foo)
{
struct ncp_inode_info *ei = (struct ncp_inode_info *) foo;
Index: linux-2.6/fs/nilfs2/super.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/super.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/nilfs2/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -162,10 +162,13 @@ struct inode *nilfs_alloc_inode(struct s
return &ii->vfs_inode;
}

-void nilfs_destroy_inode(struct inode *inode)
+static void nilfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
struct nilfs_mdt_info *mdi = NILFS_MDT(inode);

+ INIT_LIST_HEAD(&inode->i_dentry);
+
if (mdi) {
kfree(mdi->mi_bgl); /* kfree(NULL) is safe */
kfree(mdi);
@@ -173,6 +176,11 @@ void nilfs_destroy_inode(struct inode *i
kmem_cache_free(nilfs_inode_cachep, NILFS_I(inode));
}

+void nilfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, nilfs_i_callback);
+}
+
static int nilfs_sync_super(struct nilfs_sb_info *sbi, int flag)
{
struct the_nilfs *nilfs = sbi->s_nilfs;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/fs/ntfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -332,6 +332,13 @@ struct inode *ntfs_alloc_big_inode(struc
return NULL;
}

+static void ntfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+}
+
void ntfs_destroy_big_inode(struct inode *inode)
{
ntfs_inode *ni = NTFS_I(inode);
@@ -340,7 +347,7 @@ void ntfs_destroy_big_inode(struct inode
BUG_ON(ni->page);
if (!atomic_dec_and_test(&ni->count))
BUG();
- kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+ call_rcu(&inode->i_rcu, ntfs_i_callback);
}

static inline ntfs_inode *ntfs_alloc_extent_inode(void)
Index: linux-2.6/fs/ocfs2/dlmfs/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmfs/dlmfs.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ocfs2/dlmfs/dlmfs.c 2010-11-09 22:11:10.000000000 +1100
@@ -351,11 +351,18 @@ static struct inode *dlmfs_alloc_inode(s
return &ip->ip_vfs_inode;
}

-static void dlmfs_destroy_inode(struct inode *inode)
+static void dlmfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(dlmfs_inode_cache, DLMFS_I(inode));
}

+static void dlmfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, dlmfs_i_callback);
+}
+
static void dlmfs_evict_inode(struct inode *inode)
{
int status;
Index: linux-2.6/fs/ocfs2/super.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ocfs2/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -570,11 +570,18 @@ static struct inode *ocfs2_alloc_inode(s
return &oi->vfs_inode;
}

-static void ocfs2_destroy_inode(struct inode *inode)
+static void ocfs2_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(ocfs2_inode_cachep, OCFS2_I(inode));
}

+static void ocfs2_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, ocfs2_i_callback);
+}
+
static unsigned long long ocfs2_max_file_offset(unsigned int bbits,
unsigned int cbits)
{
Index: linux-2.6/fs/openpromfs/inode.c
===================================================================
--- linux-2.6.orig/fs/openpromfs/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/openpromfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -343,11 +343,18 @@ static struct inode *openprom_alloc_inod
return &oi->vfs_inode;
}

-static void openprom_destroy_inode(struct inode *inode)
+static void openprom_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(op_inode_cachep, OP_I(inode));
}

+static void openprom_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, openprom_i_callback);
+}
+
static struct inode *openprom_iget(struct super_block *sb, ino_t ino)
{
struct inode *inode;
Index: linux-2.6/fs/qnx4/inode.c
===================================================================
--- linux-2.6.orig/fs/qnx4/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/qnx4/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -425,11 +425,18 @@ static struct inode *qnx4_alloc_inode(st
return &ei->vfs_inode;
}

-static void qnx4_destroy_inode(struct inode *inode)
+static void qnx4_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(qnx4_inode_cachep, qnx4_i(inode));
}

+static void qnx4_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, qnx4_i_callback);
+}
+
static void init_once(void *foo)
{
struct qnx4_inode_info *ei = (struct qnx4_inode_info *) foo;
Index: linux-2.6/fs/reiserfs/super.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/reiserfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -530,11 +530,18 @@ static struct inode *reiserfs_alloc_inod
return &ei->vfs_inode;
}

-static void reiserfs_destroy_inode(struct inode *inode)
+static void reiserfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(reiserfs_inode_cachep, REISERFS_I(inode));
}

+static void reiserfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, reiserfs_i_callback);
+}
+
static void init_once(void *foo)
{
struct reiserfs_inode_info *ei = (struct reiserfs_inode_info *)foo;
Index: linux-2.6/fs/romfs/super.c
===================================================================
--- linux-2.6.orig/fs/romfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/romfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -400,11 +400,18 @@ static struct inode *romfs_alloc_inode(s
/*
* return a spent inode to the slab cache
*/
-static void romfs_destroy_inode(struct inode *inode)
+static void romfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(romfs_inode_cachep, ROMFS_I(inode));
}

+static void romfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, romfs_i_callback);
+}
+
/*
* get filesystem statistics
*/
Index: linux-2.6/fs/squashfs/super.c
===================================================================
--- linux-2.6.orig/fs/squashfs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/squashfs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -440,11 +440,18 @@ static struct inode *squashfs_alloc_inod
}


-static void squashfs_destroy_inode(struct inode *inode)
+static void squashfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(squashfs_inode_cachep, squashfs_i(inode));
}

+static void squashfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, squashfs_i_callback);
+}
+

static struct file_system_type squashfs_fs_type = {
.owner = THIS_MODULE,
Index: linux-2.6/fs/sysv/inode.c
===================================================================
--- linux-2.6.orig/fs/sysv/inode.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/sysv/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -333,11 +333,18 @@ static struct inode *sysv_alloc_inode(st
return &si->vfs_inode;
}

-static void sysv_destroy_inode(struct inode *inode)
+static void sysv_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(sysv_inode_cachep, SYSV_I(inode));
}

+static void sysv_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, sysv_i_callback);
+}
+
static void init_once(void *p)
{
struct sysv_inode_info *si = (struct sysv_inode_info *)p;
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ubifs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -272,12 +272,20 @@ static struct inode *ubifs_alloc_inode(s
return &ui->vfs_inode;
};

+static void ubifs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct ubifs_inode *ui = ubifs_inode(inode);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ubifs_inode_slab, ui);
+}
+
static void ubifs_destroy_inode(struct inode *inode)
{
struct ubifs_inode *ui = ubifs_inode(inode);

kfree(ui->data);
- kmem_cache_free(ubifs_inode_slab, inode);
+ call_rcu(&inode->i_rcu, ubifs_i_callback);
}

/*
Index: linux-2.6/fs/udf/super.c
===================================================================
--- linux-2.6.orig/fs/udf/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/udf/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -139,11 +139,18 @@ static struct inode *udf_alloc_inode(str
return &ei->vfs_inode;
}

-static void udf_destroy_inode(struct inode *inode)
+static void udf_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(udf_inode_cachep, UDF_I(inode));
}

+static void udf_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, udf_i_callback);
+}
+
static void init_once(void *foo)
{
struct udf_inode_info *ei = (struct udf_inode_info *)foo;
Index: linux-2.6/fs/ufs/super.c
===================================================================
--- linux-2.6.orig/fs/ufs/super.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/ufs/super.c 2010-11-09 22:11:10.000000000 +1100
@@ -1412,11 +1412,18 @@ static struct inode *ufs_alloc_inode(str
return &ei->vfs_inode;
}

-static void ufs_destroy_inode(struct inode *inode)
+static void ufs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(ufs_inode_cachep, UFS_I(inode));
}

+static void ufs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, ufs_i_callback);
+}
+
static void init_once(void *foo)
{
struct ufs_inode_info *ei = (struct ufs_inode_info *) foo;
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -71,12 +71,18 @@ spufs_alloc_inode(struct super_block *sb
return &ei->vfs_inode;
}

-static void
-spufs_destroy_inode(struct inode *inode)
+static void spufs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(spufs_inode_cache, SPUFS_I(inode));
}

+static void spufs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, spufs_i_callback);
+}
+
static void
spufs_init_once(void *p)
{
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -826,6 +826,14 @@ const struct address_space_operations po
.set_page_dirty = __set_page_dirty_nobuffers,
};

+static void pohmelfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(pohmelfs_inode_cache, POHMELFS_I(inode));
+ atomic_long_dec(&psb->total_inodes);
+}
+
/*
* ->detroy_inode() callback. Deletes inode from the caches
* and frees private data.
@@ -842,8 +850,7 @@ static void pohmelfs_destroy_inode(struc

dprintk("%s: pi: %p, inode: %p, ino: %llu.\n",
__func__, pi, &pi->vfs_inode, pi->ino);
- kmem_cache_free(pohmelfs_inode_cache, pi);
- atomic_long_dec(&psb->total_inodes);
+ call_rcu(&inode->i_rcu, pohmelfs_i_callback);
}

/*
Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/hostfs/hostfs_kern.c 2010-11-09 23:22:53.000000000 +1100
@@ -251,11 +251,18 @@ static void hostfs_evict_inode(struct in
}
}

-static void hostfs_destroy_inode(struct inode *inode)
+static void hostfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kfree(HOSTFS_I(inode));
}

+static void hostfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, hostfs_i_callback);
+}
+
static int hostfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
{
const char *root_path = vfs->mnt_sb->s_fs_info;
Index: linux-2.6/fs/hppfs/hppfs.c
===================================================================
--- linux-2.6.orig/fs/hppfs/hppfs.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/hppfs/hppfs.c 2010-11-09 22:11:10.000000000 +1100
@@ -632,11 +632,18 @@ void hppfs_evict_inode(struct inode *ino
mntput(ino->i_sb->s_fs_info);
}

-static void hppfs_destroy_inode(struct inode *inode)
+static void hppfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kfree(HPPFS_I(inode));
}

+static void hppfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, hppfs_i_callback);
+}
+
static const struct super_operations hppfs_sbops = {
.alloc_inode = hppfs_alloc_inode,
.destroy_inode = hppfs_destroy_inode,
Index: linux-2.6/fs/xfs/xfs_iget.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_iget.c 2010-11-09 22:11:02.000000000 +1100
+++ linux-2.6/fs/xfs/xfs_iget.c 2010-11-09 22:11:10.000000000 +1100
@@ -91,6 +91,17 @@ xfs_inode_alloc(
return ip;
}

+STATIC void
+xfs_inode_free_callback(
+ struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct xfs_inode *ip = XFS_I(inode);
+
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_zone_free(xfs_inode_zone, ip);
+}
+
void
xfs_inode_free(
struct xfs_inode *ip)
@@ -134,7 +145,7 @@ xfs_inode_free(
ASSERT(!spin_is_locked(&ip->i_flags_lock));
ASSERT(completion_done(&ip->i_flush));

- kmem_zone_free(xfs_inode_zone, ip);
+ call_rcu(&ip->i_vnode.i_rcu, xfs_inode_free_callback);
}

/*
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting 2010-11-09 22:11:01.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting 2010-11-09 23:22:52.000000000 +1100
@@ -318,3 +318,15 @@ if it's zero is not *and* *never* *had*
may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
free the on-disk inode, you may end up doing that while ->write_inode() is writing
to it.
+
+--
+[mandatory]
+
+ Filesystems must RCU-free their inodes, if they can have been accessed via
+rcu-walk path walk (basically, if the file can have had a path name in the vfs
+namespace).
+
+ i_dentry and i_rcu share storage in a union, and the vfs expects i_dentry
+to be reinitialized before it is freed, so INIT_LIST_HEAD(&inode->i_dentry)
+must be done in the RCU callback.
+


2010-11-09 12:48:01

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/6] fs: icache avoid RCU freeing for pseudo fs

Filesystems that don't put inode on any RCU list or reachable by rcu-walk
dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <[email protected]>

---
fs/inode.c | 6 ++++++
fs/pipe.c | 6 +++++-
include/linux/fs.h | 1 +
include/linux/net.h | 1 +
net/socket.c | 17 +++++++++--------
5 files changed, 22 insertions(+), 9 deletions(-)

Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/net/socket.c 2010-11-09 22:11:10.000000000 +1100
@@ -262,20 +262,21 @@ static struct inode *sock_alloc_inode(st
}


-static void sock_free_rcu(struct rcu_head *head)
+
+static void wq_free_rcu(struct rcu_head *head)
{
- struct inode *inode = container_of(head, struct inode, i_rcu);
- struct socket_alloc *ei = container_of(inode, struct socket_alloc,
- vfs_inode);
+ struct socket_wq *wq = container_of(head, struct socket_wq, rcu);

- kfree(ei->socket.wq);
- INIT_LIST_HEAD(&inode->i_dentry);
- kmem_cache_free(sock_inode_cachep, ei);
+ kfree(wq);
}

static void sock_destroy_inode(struct inode *inode)
{
- call_rcu(&inode->i_rcu, sock_free_rcu);
+ struct socket_alloc *ei;
+
+ ei = container_of(inode, struct socket_alloc, vfs_inode);
+ call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
+ kmem_cache_free(sock_inode_cachep, ei);
}

static void init_once(void *foo)
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -255,6 +255,12 @@ static struct inode *alloc_inode(struct
return inode;
}

+void free_inode_nonrcu(struct inode *inode)
+{
+ kmem_cache_free(inode_cachep, inode);
+}
+EXPORT_SYMBOL(free_inode_nonrcu);
+
void __destroy_inode(struct inode *inode)
{
BUG_ON(inode_has_buffers(inode));
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c 2010-11-09 22:11:00.000000000 +1100
+++ linux-2.6/fs/pipe.c 2010-11-09 22:11:10.000000000 +1100
@@ -1241,6 +1241,10 @@ long pipe_fcntl(struct file *file, unsig
return ret;
}

+static const struct super_operations pipefs_ops = {
+ .destroy_inode = free_inode_nonrcu,
+};
+
/*
* pipefs should _never_ be mounted by userland - too much of security hassle,
* no real gain from having the whole whorehouse mounted. So we don't need
@@ -1250,7 +1254,7 @@ long pipe_fcntl(struct file *file, unsig
static struct dentry *pipefs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
- return mount_pseudo(fs_type, "pipe:", NULL, PIPEFS_MAGIC);
+ return mount_pseudo(fs_type, "pipe:", &pipefs_ops, PIPEFS_MAGIC);
}

static struct file_system_type pipe_fs_type = {
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/include/linux/fs.h 2010-11-09 22:11:10.000000000 +1100
@@ -2233,6 +2233,7 @@ extern void iget_failed(struct inode *);
extern void end_writeback(struct inode *);
extern void __destroy_inode(struct inode *);
extern struct inode *new_inode(struct super_block *);
+extern void free_inode_nonrcu(struct inode *inode);
extern int should_remove_suid(struct dentry *);
extern int file_remove_suid(struct file *);

Index: linux-2.6/include/linux/net.h
===================================================================
--- linux-2.6.orig/include/linux/net.h 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/include/linux/net.h 2010-11-09 22:11:10.000000000 +1100
@@ -120,6 +120,7 @@ enum sock_shutdown_cmd {
struct socket_wq {
wait_queue_head_t wait;
struct fasync_struct *fasync_list;
+ struct rcu_head rcu;
} ____cacheline_aligned_in_smp;

/**

2010-11-09 12:59:00

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/6] fs: dcache documentation cleanup

I would like to get the following 3 patches merged soon, because they
touch a lot of filesystems but quite trivially. (this clashes trivially
with rcu patches if taken out of order, but easily fixable).

Where are we going?

http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git;a=blob;f=Documentation/filesystems/path-lookup.txt;h=5435c136d0307854761e9aaaa9886e282d67a881;hb=refs/heads/vfs-scale

So please review that if you still have doubts about the overall design.

I have been toying with pulling rcu-walk first, before dcache scaling.
I'm not really happy with it, because the property that d_lock protects
d_count and d_seq and all other elements of the dentry really help make
it easy to see that concurrency isn't a problem when doing tricky things
like converting rcu-and-seqlock based walking into lock-and-refcount
based walking mid-way through a path lookup.

So while it is _possible_ to use dcache_lock and careful use of
atomic_inc_not_zero etc. for these things, that really sucks and is
going in totally the wrong direction (rcu-walk dropping is not a
completely rare event, so we could end up with more dcache_lock locking
in path walk than we have now).

--

Remove redundant (actually incorrect, since dcache RCU lookup) dentry
locking documentation and point to the canonical document.

Signed-off-by: Nick Piggin <[email protected]>

---
include/linux/dcache.h | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h 2010-11-09 22:11:00.000000000 +1100
+++ linux-2.6/include/linux/dcache.h 2010-11-09 23:22:52.000000000 +1100
@@ -141,22 +141,16 @@ struct dentry_operations {
char *(*d_dname)(struct dentry *, char *, int);
};

-/* the dentry parameter passed to d_hash and d_compare is the parent
+/*
+ * Locking rules for dentry_operations callbacks are to be found in
+ * Documentation/filesystems/Locking. Keep it updated!
+ *
+ * the dentry parameter passed to d_hash and d_compare is the parent
* directory of the entries to be compared. It is used in case these
* functions need any directory specific information for determining
* equivalency classes. Using the dentry itself might not work, as it
* might be a negative dentry which has no information associated with
- * it */
-
-/*
-locking rules:
- big lock dcache_lock d_lock may block
-d_revalidate: no no no yes
-d_hash no no no yes
-d_compare: no yes yes no
-d_delete: no yes no no
-d_release: no no no yes
-d_iput: no no no yes
+ * it.
*/

/* d_flags entries */

2010-11-09 13:01:42

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/6] fs: d_delete change

Actually this change is required for basic dcache scaling rather than
rcu-walk, but I think it's a good one anyway. Should be easy to merge.

--

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <[email protected]>

---
Documentation/filesystems/porting | 7 +++++++
Documentation/filesystems/vfs.txt | 27 +++++++++++++--------------
arch/ia64/kernel/perfmon.c | 2 +-
fs/9p/vfs_dentry.c | 4 ++--
fs/afs/dir.c | 4 ++--
fs/btrfs/inode.c | 2 +-
fs/coda/dir.c | 4 ++--
fs/configfs/dir.c | 2 +-
fs/dcache.c | 2 --
fs/gfs2/dentry.c | 2 +-
fs/hostfs/hostfs_kern.c | 2 +-
fs/libfs.c | 2 +-
fs/ncpfs/dir.c | 4 ++--
fs/nfs/dir.c | 2 +-
fs/proc/base.c | 2 +-
fs/proc/generic.c | 2 +-
fs/proc/proc_sysctl.c | 2 +-
fs/sysfs/dir.c | 2 +-
include/linux/dcache.h | 6 +++---
net/sunrpc/rpc_pipe.c | 2 +-
20 files changed, 43 insertions(+), 39 deletions(-)

Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting 2010-11-09 23:22:51.000000000 +1100
@@ -330,3 +330,10 @@ namespace).
to be reinitialized before it is freed, so INIT_LIST_HEAD(&inode->i_dentry)
must be done in the RCU callback.

+---
+[mandatory]
+
+ .d_delete() now only advises the dcache as to whether or not to cache
+unreferenced dentries, and is now only called when the dentry refcount goes to
+0. Even on 0 refcount transition, it must be able to tolerate being called 0, 1,
+or more times (eg. constant, idempotent).
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt 2010-11-09 23:22:51.000000000 +1100
@@ -841,9 +841,9 @@ the VFS uses a default. As of kernel 2.6

struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
- int (*d_hash) (struct dentry *, struct qstr *);
- int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
- int (*d_delete)(struct dentry *);
+ int (*d_hash)(struct dentry *, struct qstr *);
+ int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+ int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
@@ -858,9 +858,11 @@ struct dentry_operations {

d_compare: called when a dentry should be compared with another

- d_delete: called when the last reference to a dentry is
- deleted. This means no-one is using the dentry, however it is
- still valid and in the dcache
+ d_delete: called when the last reference to a dentry is dropped and the
+ dcache is deciding whether or not to cache it. Return 1 to delete
+ immediately, or 0 to cache the dentry. Default is NULL which means to
+ always cache a reachable dentry. d_delete must be constant and
+ idempotent.

d_release: called when a dentry is really deallocated

@@ -904,14 +906,11 @@ There are a number of functions defined
the usage count)

dput: close a handle for a dentry (decrements the usage count). If
- the usage count drops to 0, the "d_delete" method is called
- and the dentry is placed on the unused list if the dentry is
- still in its parents hash list. Putting the dentry on the
- unused list just means that if the system needs some RAM, it
- goes through the unused list of dentries and deallocates them.
- If the dentry has already been unhashed and the usage count
- drops to 0, in this case the dentry is deallocated after the
- "d_delete" method is called
+ the usage count drops to 0, and the dentry is still in its
+ parent's hash, the "d_delete" method is called to check whether
+ it should be cached. If it should not be cached, or if the dentry
+ is not hashed, it is deleted. Otherwise cached dentries are put
+ into an LRU list to be reclaimed on memory shortage.

d_drop: this unhashes a dentry from its parents hash list. A
subsequent call to dput() will deallocate the dentry if its
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/dcache.c 2010-11-09 23:22:51.000000000 +1100
@@ -446,8 +446,6 @@ static void prune_one_dentry(struct dent
if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
return;

- if (dentry->d_op && dentry->d_op->d_delete)
- dentry->d_op->d_delete(dentry);
dentry_lru_del(dentry);
__d_drop(dentry);
dentry = d_kill(dentry);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/include/linux/dcache.h 2010-11-09 23:22:51.000000000 +1100
@@ -133,9 +133,9 @@ enum dentry_d_lock_class

struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
- int (*d_hash) (struct dentry *, struct qstr *);
- int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
- int (*d_delete)(struct dentry *);
+ int (*d_hash)(struct dentry *, struct qstr *);
+ int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+ int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/arch/ia64/kernel/perfmon.c 2010-11-09 22:11:10.000000000 +1100
@@ -2185,7 +2185,7 @@ static const struct file_operations pfm_
};

static int
-pfmfs_delete_dentry(struct dentry *dentry)
+pfmfs_delete_dentry(const struct dentry *dentry)
{
return 1;
}
Index: linux-2.6/fs/9p/vfs_dentry.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_dentry.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/9p/vfs_dentry.c 2010-11-09 22:11:10.000000000 +1100
@@ -51,7 +51,7 @@
*
*/

-static int v9fs_dentry_delete(struct dentry *dentry)
+static int v9fs_dentry_delete(const struct dentry *dentry)
{
P9_DPRINTK(P9_DEBUG_VFS, " dentry: %s (%p)\n", dentry->d_name.name,
dentry);
@@ -68,7 +68,7 @@ static int v9fs_dentry_delete(struct den
*
*/

-static int v9fs_cached_dentry_delete(struct dentry *dentry)
+static int v9fs_cached_dentry_delete(const struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
P9_DPRINTK(P9_DEBUG_VFS, " dentry: %s (%p)\n", dentry->d_name.name,
Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c 2010-11-09 22:11:00.000000000 +1100
+++ linux-2.6/fs/afs/dir.c 2010-11-09 22:11:10.000000000 +1100
@@ -23,7 +23,7 @@ static struct dentry *afs_lookup(struct
static int afs_dir_open(struct inode *inode, struct file *file);
static int afs_readdir(struct file *file, void *dirent, filldir_t filldir);
static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd);
-static int afs_d_delete(struct dentry *dentry);
+static int afs_d_delete(const struct dentry *dentry);
static void afs_d_release(struct dentry *dentry);
static int afs_lookup_filldir(void *_cookie, const char *name, int nlen,
loff_t fpos, u64 ino, unsigned dtype);
@@ -730,7 +730,7 @@ static int afs_d_revalidate(struct dentr
* - called from dput() when d_count is going to 0.
* - return 1 to request dentry be unhashed, 0 otherwise
*/
-static int afs_d_delete(struct dentry *dentry)
+static int afs_d_delete(const struct dentry *dentry)
{
_enter("%s", dentry->d_name.name);

Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c 2010-11-09 22:11:10.000000000 +1100
@@ -4127,7 +4127,7 @@ struct inode *btrfs_lookup_dentry(struct
return inode;
}

-static int btrfs_dentry_delete(struct dentry *dentry)
+static int btrfs_dentry_delete(const struct dentry *dentry)
{
struct btrfs_root *root;

Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/coda/dir.c 2010-11-09 22:11:10.000000000 +1100
@@ -47,7 +47,7 @@ static int coda_readdir(struct file *fil

/* dentry ops */
static int coda_dentry_revalidate(struct dentry *de, struct nameidata *nd);
-static int coda_dentry_delete(struct dentry *);
+static int coda_dentry_delete(const struct dentry *);

/* support routines */
static int coda_venus_readdir(struct file *coda_file, void *buf,
@@ -577,7 +577,7 @@ static int coda_dentry_revalidate(struct
* This is the callback from dput() when d_count is going to 0.
* We use this to unhash dentries with bad inodes.
*/
-static int coda_dentry_delete(struct dentry * dentry)
+static int coda_dentry_delete(const struct dentry * dentry)
{
int flags;

Index: linux-2.6/fs/configfs/dir.c
===================================================================
--- linux-2.6.orig/fs/configfs/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/configfs/dir.c 2010-11-09 22:11:10.000000000 +1100
@@ -67,7 +67,7 @@ static void configfs_d_iput(struct dentr
* We _must_ delete our dentries on last dput, as the chain-to-parent
* behavior is required to clear the parents of default_groups.
*/
-static int configfs_d_delete(struct dentry *dentry)
+static int configfs_d_delete(const struct dentry *dentry)
{
return 1;
}
Index: linux-2.6/fs/gfs2/dentry.c
===================================================================
--- linux-2.6.orig/fs/gfs2/dentry.c 2010-11-09 22:11:00.000000000 +1100
+++ linux-2.6/fs/gfs2/dentry.c 2010-11-09 23:22:50.000000000 +1100
@@ -106,7 +106,7 @@ static int gfs2_dhash(struct dentry *den
return 0;
}

-static int gfs2_dentry_delete(struct dentry *dentry)
+static int gfs2_dentry_delete(const struct dentry *dentry)
{
struct gfs2_inode *ginode;

Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/hostfs/hostfs_kern.c 2010-11-09 22:11:10.000000000 +1100
@@ -32,7 +32,7 @@ static inline struct hostfs_inode_info *

#define FILE_HOSTFS_I(file) HOSTFS_I((file)->f_path.dentry->d_inode)

-static int hostfs_d_delete(struct dentry *dentry)
+static int hostfs_d_delete(const struct dentry *dentry)
{
return 1;
}
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/libfs.c 2010-11-09 22:11:10.000000000 +1100
@@ -37,7 +37,7 @@ int simple_statfs(struct dentry *dentry,
* Retaining negative dentries for an in-memory filesystem just wastes
* memory and lookup time: arrange for them to be deleted immediately.
*/
-static int simple_delete_dentry(struct dentry *dentry)
+static int simple_delete_dentry(const struct dentry *dentry)
{
return 1;
}
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/ncpfs/dir.c 2010-11-09 23:22:51.000000000 +1100
@@ -77,7 +77,7 @@ const struct inode_operations ncp_dir_in
static int ncp_lookup_validate(struct dentry *, struct nameidata *);
static int ncp_hash_dentry(struct dentry *, struct qstr *);
static int ncp_compare_dentry (struct dentry *, struct qstr *, struct qstr *);
-static int ncp_delete_dentry(struct dentry *);
+static int ncp_delete_dentry(const struct dentry *);

static const struct dentry_operations ncp_dentry_operations =
{
@@ -163,7 +163,7 @@ ncp_compare_dentry(struct dentry *dentry
* Closing files can be safely postponed until iput() - it's done there anyway.
*/
static int
-ncp_delete_dentry(struct dentry * dentry)
+ncp_delete_dentry(const struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;

Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/nfs/dir.c 2010-11-09 22:11:10.000000000 +1100
@@ -1091,7 +1091,7 @@ static int nfs_lookup_revalidate(struct
/*
* This is called from dput() when d_count is going to 0.
*/
-static int nfs_dentry_delete(struct dentry *dentry)
+static int nfs_dentry_delete(const struct dentry *dentry)
{
dfprintk(VFS, "NFS: dentry_delete(%s/%s, %x)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/proc/base.c 2010-11-09 22:11:10.000000000 +1100
@@ -1744,7 +1744,7 @@ static int pid_revalidate(struct dentry
return 0;
}

-static int pid_delete_dentry(struct dentry * dentry)
+static int pid_delete_dentry(const struct dentry * dentry)
{
/* Is the task we represent dead?
* If so, then don't put the dentry on the lru list,
Index: linux-2.6/fs/proc/generic.c
===================================================================
--- linux-2.6.orig/fs/proc/generic.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/proc/generic.c 2010-11-09 22:11:10.000000000 +1100
@@ -400,7 +400,7 @@ static const struct inode_operations pro
* smarter: we could keep a "volatile" flag in the
* inode to indicate which ones to keep.
*/
-static int proc_delete_dentry(struct dentry * dentry)
+static int proc_delete_dentry(const struct dentry * dentry)
{
return 1;
}
Index: linux-2.6/fs/proc/proc_sysctl.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_sysctl.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/proc/proc_sysctl.c 2010-11-09 23:22:51.000000000 +1100
@@ -392,7 +392,7 @@ static int proc_sys_revalidate(struct de
return !PROC_I(dentry->d_inode)->sysctl->unregistering;
}

-static int proc_sys_delete(struct dentry *dentry)
+static int proc_sys_delete(const struct dentry *dentry)
{
return !!PROC_I(dentry->d_inode)->sysctl->unregistering;
}
Index: linux-2.6/fs/sysfs/dir.c
===================================================================
--- linux-2.6.orig/fs/sysfs/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/sysfs/dir.c 2010-11-09 22:11:10.000000000 +1100
@@ -231,7 +231,7 @@ void release_sysfs_dirent(struct sysfs_d
goto repeat;
}

-static int sysfs_dentry_delete(struct dentry *dentry)
+static int sysfs_dentry_delete(const struct dentry *dentry)
{
struct sysfs_dirent *sd = dentry->d_fsdata;
return !!(sd->s_flags & SYSFS_FLAG_REMOVED);
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/net/sunrpc/rpc_pipe.c 2010-11-09 22:11:10.000000000 +1100
@@ -438,7 +438,7 @@ void rpc_put_mount(void)
}
EXPORT_SYMBOL_GPL(rpc_put_mount);

-static int rpc_delete_dentry(struct dentry *dentry)
+static int rpc_delete_dentry(const struct dentry *dentry)
{
return 1;
}

2010-11-09 13:02:44

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/6] fs: d_compare change for rcu-walk

Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

cifs and jfs contain the non-trivial changes, where they no longer
overwrite dentry name in their d_compare(), but rather use the
d_revalidate() method for ensuring case preservation in the presence
of negative dentries, taken from fatfs.

Signed-off-by: Nick Piggin <[email protected]>

---
Documentation/filesystems/Locking | 4 +
Documentation/filesystems/porting | 7 +++
Documentation/filesystems/vfs.txt | 25 +++++++++-
fs/adfs/dir.c | 8 ++-
fs/affs/namei.c | 44 +++++++++++--------
fs/cifs/dir.c | 52 ++++++++++++----------
fs/dcache.c | 4 +
fs/fat/namei_msdos.c | 15 +++---
fs/fat/namei_vfat.c | 39 +++++++++++-----
fs/hfs/hfs_fs.h | 4 +
fs/hfs/string.c | 14 +++---
fs/hfsplus/hfsplus_fs.h | 4 +
fs/hfsplus/unicode.c | 14 +++---
fs/hpfs/dentry.c | 21 +++++----
fs/isofs/inode.c | 88 ++++++++++++++++++--------------------
fs/isofs/namei.c | 3 -
fs/jfs/namei.c | 53 +++++++++++++++++-----
fs/ncpfs/dir.c | 29 ++++++++----
fs/ncpfs/ncplib_kernel.h | 8 +--
fs/proc/proc_sysctl.c | 12 ++---
include/linux/dcache.h | 12 ++---
include/linux/ncp_fs.h | 4 -
22 files changed, 286 insertions(+), 178 deletions(-)

Index: linux-2.6/fs/adfs/dir.c
===================================================================
--- linux-2.6.orig/fs/adfs/dir.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/adfs/dir.c 2010-11-09 23:22:50.000000000 +1100
@@ -237,17 +237,19 @@ adfs_hash(struct dentry *parent, struct
* requirements of the underlying filesystem.
*/
static int
-adfs_compare(struct dentry *parent, struct qstr *entry, struct qstr *name)
+adfs_compare(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
int i;

- if (entry->len != name->len)
+ if (len != name->len)
return 1;

for (i = 0; i < name->len; i++) {
char a, b;

- a = entry->name[i];
+ a = str[i];
b = name->name[i];

if (a >= 'A' && a <= 'Z')
Index: linux-2.6/fs/cifs/dir.c
===================================================================
--- linux-2.6.orig/fs/cifs/dir.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/cifs/dir.c 2010-11-09 23:22:49.000000000 +1100
@@ -656,22 +656,34 @@ cifs_lookup(struct inode *parent_dir_ino
static int
cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
{
- int isValid = 1;
-
if (direntry->d_inode) {
if (cifs_revalidate_dentry(direntry))
return 0;
- } else {
- cFYI(1, "neg dentry 0x%p name = %s",
- direntry, direntry->d_name.name);
- if (time_after(jiffies, direntry->d_time + HZ) ||
- !lookupCacheEnabled) {
- d_drop(direntry);
- isValid = 0;
- }
+ else
+ return 1;
+ }
+
+ /*
+ * This may be nfsd (or something), anyway, we can't see the
+ * intent of this. So, since this can be for creation, drop it.
+ */
+ if (!nd)
+ return 0;
+
+ /*
+ * Drop the negative dentry, in order to make sure to use the
+ * case sensitive name which is specified by user if this is
+ * for creation.
+ */
+ if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) {
+ if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
+ return 0;
}

- return isValid;
+ if (time_after(jiffies, direntry->d_time + HZ) || !lookupCacheEnabled)
+ return 0;
+
+ return 1;
}

/* static int cifs_d_delete(struct dentry *direntry)
@@ -703,21 +715,15 @@ static int cifs_ci_hash(struct dentry *d
return 0;
}

-static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
- struct qstr *b)
+static int cifs_ci_compare(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
+ struct nls_table *codepage = CIFS_SB(inode->i_sb)->local_nls;

- if ((a->len == b->len) &&
- (nls_strnicmp(codepage, a->name, b->name, a->len) == 0)) {
- /*
- * To preserve case, don't let an existing negative dentry's
- * case take precedence. If a is not a negative dentry, this
- * should have no side effects
- */
- memcpy((void *)a->name, b->name, a->len);
+ if ((name->len == len) &&
+ (nls_strnicmp(codepage, name->name, str, len) == 0))
return 0;
- }
return 1;
}

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/dcache.c 2010-11-09 23:22:49.000000000 +1100
@@ -1433,7 +1433,9 @@ struct dentry * __d_lookup(struct dentry
*/
qstr = &dentry->d_name;
if (parent->d_op && parent->d_op->d_compare) {
- if (parent->d_op->d_compare(parent, qstr, name))
+ if (parent->d_op->d_compare(parent,
+ dentry, dentry->d_inode,
+ qstr->len, qstr->name, name))
goto next;
} else {
if (qstr->len != len)
Index: linux-2.6/fs/fat/namei_msdos.c
===================================================================
--- linux-2.6.orig/fs/fat/namei_msdos.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/fat/namei_msdos.c 2010-11-09 23:22:49.000000000 +1100
@@ -164,16 +164,19 @@ static int msdos_hash(struct dentry *den
* Compare two msdos names. If either of the names are invalid,
* we fall back to doing the standard name comparison.
*/
-static int msdos_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int msdos_cmp(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str,
+ const struct qstr *name)
{
- struct fat_mount_options *options = &MSDOS_SB(dentry->d_sb)->options;
+ struct fat_mount_options *options = &MSDOS_SB(parent->d_sb)->options;
unsigned char a_msdos_name[MSDOS_NAME], b_msdos_name[MSDOS_NAME];
int error;

- error = msdos_format_name(a->name, a->len, a_msdos_name, options);
+ error = msdos_format_name(name->name, name->len, a_msdos_name, options);
if (error)
goto old_compare;
- error = msdos_format_name(b->name, b->len, b_msdos_name, options);
+ error = msdos_format_name(str, len, b_msdos_name, options);
if (error)
goto old_compare;
error = memcmp(a_msdos_name, b_msdos_name, MSDOS_NAME);
@@ -182,8 +185,8 @@ static int msdos_cmp(struct dentry *dent

old_compare:
error = 1;
- if (a->len == b->len)
- error = memcmp(a->name, b->name, a->len);
+ if (name->len == len)
+ error = memcmp(name->name, str, len);
goto out;
}

Index: linux-2.6/fs/fat/namei_vfat.c
===================================================================
--- linux-2.6.orig/fs/fat/namei_vfat.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/fat/namei_vfat.c 2010-11-09 23:22:49.000000000 +1100
@@ -85,15 +85,18 @@ static int vfat_revalidate_ci(struct den
}

/* returns the length of a struct qstr, ignoring trailing dots */
-static unsigned int vfat_striptail_len(struct qstr *qstr)
+static unsigned int __vfat_striptail_len(unsigned int len, const char *name)
{
- unsigned int len = qstr->len;
-
- while (len && qstr->name[len - 1] == '.')
+ while (len && name[len - 1] == '.')
len--;
return len;
}

+static unsigned int vfat_striptail_len(const struct qstr *qstr)
+{
+ return __vfat_striptail_len(qstr->len, qstr->name);
+}
+
/*
* Compute the hash for the vfat name corresponding to the dentry.
* Note: if the name is invalid, we leave the hash code unchanged so
@@ -133,16 +136,18 @@ static int vfat_hashi(struct dentry *den
/*
* Case insensitive compare of two vfat names.
*/
-static int vfat_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int vfat_cmpi(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- struct nls_table *t = MSDOS_SB(dentry->d_inode->i_sb)->nls_io;
+ struct nls_table *t = MSDOS_SB(inode->i_sb)->nls_io;
unsigned int alen, blen;

/* A filename cannot end in '.' or we treat it like it has none */
- alen = vfat_striptail_len(a);
- blen = vfat_striptail_len(b);
+ alen = vfat_striptail_len(name);
+ blen = __vfat_striptail_len(len, str);
if (alen == blen) {
- if (nls_strnicmp(t, a->name, b->name, alen) == 0)
+ if (nls_strnicmp(t, name->name, str, alen) == 0)
return 0;
}
return 1;
@@ -151,15 +156,17 @@ static int vfat_cmpi(struct dentry *dent
/*
* Case sensitive compare of two vfat names.
*/
-static int vfat_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int vfat_cmp(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
unsigned int alen, blen;

/* A filename cannot end in '.' or we treat it like it has none */
- alen = vfat_striptail_len(a);
- blen = vfat_striptail_len(b);
+ alen = vfat_striptail_len(name);
+ blen = __vfat_striptail_len(len, str);
if (alen == blen) {
- if (strncmp(a->name, b->name, alen) == 0)
+ if (strncmp(name->name, str, alen) == 0)
return 0;
}
return 1;
@@ -780,6 +787,12 @@ static int vfat_create(struct inode *dir
struct timespec ts;
int err;

+ /*
+ * To preserve case, don't let an existing negative dentry's case
+ * take precedence.
+ */
+ memcpy((void *)dentry->d_name.name, nd->last.name, dentry->d_name.len);
+
lock_super(sb);

ts = CURRENT_TIME_SEC;
Index: linux-2.6/fs/isofs/inode.c
===================================================================
--- linux-2.6.orig/fs/isofs/inode.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/isofs/inode.c 2010-11-09 23:22:50.000000000 +1100
@@ -28,14 +28,22 @@

static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
static int isofs_hash(struct dentry *parent, struct qstr *qstr);
-static int isofs_dentry_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b);
-static int isofs_dentry_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b);
+static int isofs_dentry_cmpi(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);
+static int isofs_dentry_cmp(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);

#ifdef CONFIG_JOLIET
static int isofs_hashi_ms(struct dentry *parent, struct qstr *qstr);
static int isofs_hash_ms(struct dentry *parent, struct qstr *qstr);
-static int isofs_dentry_cmpi_ms(struct dentry *dentry, struct qstr *a, struct qstr *b);
-static int isofs_dentry_cmp_ms(struct dentry *dentry, struct qstr *a, struct qstr *b);
+static int isofs_dentry_cmpi_ms(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);
+static int isofs_dentry_cmp_ms(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);
#endif

static void isofs_put_super(struct super_block *sb)
@@ -213,49 +221,31 @@ isofs_hashi_common(struct dentry *dentry
}

/*
- * Case insensitive compare of two isofs names.
+ * Compare of two isofs names.
*/
-static int isofs_dentry_cmpi_common(struct dentry *dentry, struct qstr *a,
- struct qstr *b, int ms)
+static int isofs_dentry_cmp_common(
+ unsigned int len, const char *str,
+ const struct qstr *name, int ms, int ci)
{
int alen, blen;

/* A filename cannot end in '.' or we treat it like it has none */
- alen = a->len;
- blen = b->len;
+ alen = name->len;
+ blen = len;
if (ms) {
- while (alen && a->name[alen-1] == '.')
+ while (alen && name->name[alen-1] == '.')
alen--;
- while (blen && b->name[blen-1] == '.')
+ while (blen && str[blen-1] == '.')
blen--;
}
if (alen == blen) {
- if (strnicmp(a->name, b->name, alen) == 0)
- return 0;
- }
- return 1;
-}
-
-/*
- * Case sensitive compare of two isofs names.
- */
-static int isofs_dentry_cmp_common(struct dentry *dentry, struct qstr *a,
- struct qstr *b, int ms)
-{
- int alen, blen;
-
- /* A filename cannot end in '.' or we treat it like it has none */
- alen = a->len;
- blen = b->len;
- if (ms) {
- while (alen && a->name[alen-1] == '.')
- alen--;
- while (blen && b->name[blen-1] == '.')
- blen--;
- }
- if (alen == blen) {
- if (strncmp(a->name, b->name, alen) == 0)
- return 0;
+ if (ci) {
+ if (strnicmp(name->name, str, alen) == 0)
+ return 0;
+ } else {
+ if (strncmp(name->name, str, alen) == 0)
+ return 0;
+ }
}
return 1;
}
@@ -273,15 +263,19 @@ isofs_hashi(struct dentry *dentry, struc
}

static int
-isofs_dentry_cmp(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmp(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return isofs_dentry_cmp_common(dentry, a, b, 0);
+ return isofs_dentry_cmp_common(len, str, name, 0, 0);
}

static int
-isofs_dentry_cmpi(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmpi(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return isofs_dentry_cmpi_common(dentry, a, b, 0);
+ return isofs_dentry_cmp_common(len, str, name, 0, 1);
}

#ifdef CONFIG_JOLIET
@@ -298,15 +292,19 @@ isofs_hashi_ms(struct dentry *dentry, st
}

static int
-isofs_dentry_cmp_ms(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmp_ms(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return isofs_dentry_cmp_common(dentry, a, b, 1);
+ return isofs_dentry_cmp_common(len, str, name, 1, 0);
}

static int
-isofs_dentry_cmpi_ms(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmpi_ms(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return isofs_dentry_cmpi_common(dentry, a, b, 1);
+ return isofs_dentry_cmp_common(len, str, name, 1, 1);
}
#endif

Index: linux-2.6/fs/proc/proc_sysctl.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_sysctl.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/proc/proc_sysctl.c 2010-11-09 22:11:10.000000000 +1100
@@ -397,15 +397,15 @@ static int proc_sys_delete(const struct
return !!PROC_I(dentry->d_inode)->sysctl->unregistering;
}

-static int proc_sys_compare(struct dentry *dir, struct qstr *qstr,
- struct qstr *name)
+static int proc_sys_compare(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- struct dentry *dentry = container_of(qstr, struct dentry, d_name);
- if (qstr->len != name->len)
+ if (name->len != len)
return 1;
- if (memcmp(qstr->name, name->name, name->len))
+ if (memcmp(name->name, str, len))
return 1;
- return !sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl);
+ return !sysctl_is_seen(PROC_I(inode)->sysctl);
}

static const struct dentry_operations proc_sys_dentry_operations = {
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/include/linux/dcache.h 2010-11-09 23:22:49.000000000 +1100
@@ -134,7 +134,9 @@ enum dentry_d_lock_class
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash)(struct dentry *, struct qstr *);
- int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+ int (*d_compare)(const struct dentry *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
@@ -145,12 +147,8 @@ struct dentry_operations {
* Locking rules for dentry_operations callbacks are to be found in
* Documentation/filesystems/Locking. Keep it updated!
*
- * the dentry parameter passed to d_hash and d_compare is the parent
- * directory of the entries to be compared. It is used in case these
- * functions need any directory specific information for determining
- * equivalency classes. Using the dentry itself might not work, as it
- * might be a negative dentry which has no information associated with
- * it.
+ * FUrther descriptions are found in Documentation/filesystems/vfs.txt.
+ * Keep it updated too!
*/

/* d_flags entries */
Index: linux-2.6/fs/affs/namei.c
===================================================================
--- linux-2.6.orig/fs/affs/namei.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/affs/namei.c 2010-11-09 23:22:49.000000000 +1100
@@ -14,10 +14,14 @@ typedef int (*toupper_t)(int);

static int affs_toupper(int ch);
static int affs_hash_dentry(struct dentry *, struct qstr *);
-static int affs_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+static int affs_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);
static int affs_intl_toupper(int ch);
static int affs_intl_hash_dentry(struct dentry *, struct qstr *);
-static int affs_intl_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+static int affs_intl_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);

const struct dentry_operations affs_dentry_operations = {
.d_hash = affs_hash_dentry,
@@ -88,29 +92,29 @@ affs_intl_hash_dentry(struct dentry *den
return __affs_hash_dentry(dentry, qstr, affs_intl_toupper);
}

-static inline int
-__affs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b, toupper_t toupper)
+static inline int __affs_compare_dentry(unsigned int len,
+ const char *str, const struct qstr *name, toupper_t toupper)
{
- const u8 *aname = a->name;
- const u8 *bname = b->name;
- int len;
+ const u8 *aname = str;
+ const u8 *bname = name->name;

- /* 'a' is the qstr of an already existing dentry, so the name
- * must be valid. 'b' must be validated first.
+ /*
+ * 'str' is the name of an already existing dentry, so the name
+ * must be valid. 'name' must be validated first.
*/

- if (affs_check_name(b->name,b->len))
+ if (affs_check_name(name->name, name->len))
return 1;

- /* If the names are longer than the allowed 30 chars,
+ /*
+ * If the names are longer than the allowed 30 chars,
* the excess is ignored, so their length may differ.
*/
- len = a->len;
if (len >= 30) {
- if (b->len < 30)
+ if (name->len < 30)
return 1;
len = 30;
- } else if (len != b->len)
+ } else if (len != name->len)
return 1;

for (; len > 0; len--)
@@ -121,14 +125,18 @@ __affs_compare_dentry(struct dentry *den
}

static int
-affs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+affs_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return __affs_compare_dentry(dentry, a, b, affs_toupper);
+ return __affs_compare_dentry(len, str, name, affs_toupper);
}
static int
-affs_intl_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+affs_intl_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- return __affs_compare_dentry(dentry, a, b, affs_intl_toupper);
+ return __affs_compare_dentry(len, str, name, affs_intl_toupper);
}

/*
Index: linux-2.6/fs/hfs/hfs_fs.h
===================================================================
--- linux-2.6.orig/fs/hfs/hfs_fs.h 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/hfs/hfs_fs.h 2010-11-09 23:22:49.000000000 +1100
@@ -216,7 +216,9 @@ extern const struct dentry_operations hf
extern int hfs_hash_dentry(struct dentry *, struct qstr *);
extern int hfs_strcmp(const unsigned char *, unsigned int,
const unsigned char *, unsigned int);
-extern int hfs_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+extern int hfs_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);

/* trans.c */
extern void hfs_asc2mac(struct super_block *, struct hfs_name *, struct qstr *);
Index: linux-2.6/fs/hfs/string.c
===================================================================
--- linux-2.6.orig/fs/hfs/string.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/hfs/string.c 2010-11-09 23:22:49.000000000 +1100
@@ -92,21 +92,21 @@ int hfs_strcmp(const unsigned char *s1,
* Test for equality of two strings in the HFS filename character ordering.
* return 1 on failure and 0 on success
*/
-int hfs_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2)
+int hfs_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
const unsigned char *n1, *n2;
- int len;

- len = s1->len;
if (len >= HFS_NAMELEN) {
- if (s2->len < HFS_NAMELEN)
+ if (name->len < HFS_NAMELEN)
return 1;
len = HFS_NAMELEN;
- } else if (len != s2->len)
+ } else if (len != name->len)
return 1;

- n1 = s1->name;
- n2 = s2->name;
+ n1 = str;
+ n2 = name->name;
while (len--) {
if (caseorder[*n1++] != caseorder[*n2++])
return 1;
Index: linux-2.6/fs/hfsplus/hfsplus_fs.h
===================================================================
--- linux-2.6.orig/fs/hfsplus/hfsplus_fs.h 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/hfsplus/hfsplus_fs.h 2010-11-09 23:22:50.000000000 +1100
@@ -380,7 +380,9 @@ int hfsplus_strcmp(const struct hfsplus_
int hfsplus_uni2asc(struct super_block *, const struct hfsplus_unistr *, char *, int *);
int hfsplus_asc2uni(struct super_block *, struct hfsplus_unistr *, const char *, int);
int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str);
-int hfsplus_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2);
+int hfsplus_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name);

/* wrapper.c */
int hfsplus_read_wrapper(struct super_block *);
Index: linux-2.6/fs/hfsplus/unicode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/unicode.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/hfsplus/unicode.c 2010-11-09 23:22:50.000000000 +1100
@@ -363,9 +363,11 @@ int hfsplus_hash_dentry(struct dentry *d
* Composed unicode characters are decomposed and case-folding is performed
* if the appropriate bits are (un)set on the superblock.
*/
-int hfsplus_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2)
+int hfsplus_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- struct super_block *sb = dentry->d_sb;
+ struct super_block *sb = parent->d_sb;
int casefold, decompose, size;
int dsize1, dsize2, len1, len2;
const u16 *dstr1, *dstr2;
@@ -375,10 +377,10 @@ int hfsplus_compare_dentry(struct dentry

casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
- astr1 = s1->name;
- len1 = s1->len;
- astr2 = s2->name;
- len2 = s2->len;
+ astr1 = str;
+ len1 = len;
+ astr2 = name->name;
+ len2 = name->len;
dsize1 = dsize2 = 0;
dstr1 = dstr2 = NULL;

Index: linux-2.6/fs/hpfs/dentry.c
===================================================================
--- linux-2.6.orig/fs/hpfs/dentry.c 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/hpfs/dentry.c 2010-11-09 23:22:49.000000000 +1100
@@ -34,19 +34,24 @@ static int hpfs_hash_dentry(struct dentr
return 0;
}

-static int hpfs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int hpfs_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- unsigned al=a->len;
- unsigned bl=b->len;
- hpfs_adjust_length(a->name, &al);
+ unsigned al = len;
+ unsigned bl = name->len;
+
+ hpfs_adjust_length(str, &al);
/*hpfs_adjust_length(b->name, &bl);*/
- /* 'a' is the qstr of an already existing dentry, so the name
- * must be valid. 'b' must be validated first.
+
+ /*
+ * 'str' is the nane of an already existing dentry, so the name
+ * must be valid. 'name' must be validated first.
*/

- if (hpfs_chk_name(b->name, &bl))
+ if (hpfs_chk_name(name->name, &bl))
return 1;
- if (hpfs_compare_names(dentry->d_sb, a->name, al, b->name, bl, 0))
+ if (hpfs_compare_names(parent->d_sb, str, al, name->name, bl, 0))
return 1;
return 0;
}
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/jfs/namei.c 2010-11-09 23:22:50.000000000 +1100
@@ -18,6 +18,7 @@
*/

#include <linux/fs.h>
+#include <linux/namei.h>
#include <linux/ctype.h>
#include <linux/quotaops.h>
#include <linux/exportfs.h>
@@ -1586,32 +1587,60 @@ static int jfs_ci_hash(struct dentry *di
return 0;
}

-static int jfs_ci_compare(struct dentry *dir, struct qstr *a, struct qstr *b)
+static int jfs_ci_compare(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
int i, result = 1;

- if (a->len != b->len)
+ if (len != name->len)
goto out;
- for (i=0; i < a->len; i++) {
- if (tolower(a->name[i]) != tolower(b->name[i]))
+ for (i=0; i < len; i++) {
+ if (tolower(str[i]) != tolower(name->name[i]))
goto out;
}
result = 0;
+out:
+ return result;
+}

+static int jfs_ci_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
/*
- * We want creates to preserve case. A negative dentry, a, that
- * has a different case than b may cause a new entry to be created
- * with the wrong case. Since we can't tell if a comes from a negative
- * dentry, we blindly replace it with b. This should be harmless if
- * a is not a negative dentry.
+ * This is not negative dentry. Always valid.
+ *
+ * Note, rename() to existing directory entry will have ->d_inode,
+ * and will use existing name which isn't specified name by user.
+ *
+ * We may be able to drop this positive dentry here. But dropping
+ * positive dentry isn't good idea. So it's unsupported like
+ * rename("filename", "FILENAME") for now.
*/
- memcpy((unsigned char *)a->name, b->name, a->len);
-out:
- return result;
+ if (dentry->d_inode)
+ return 1;
+
+ /*
+ * This may be nfsd (or something), anyway, we can't see the
+ * intent of this. So, since this can be for creation, drop it.
+ */
+ if (!nd)
+ return 0;
+
+ /*
+ * Drop the negative dentry, in order to make sure to use the
+ * case sensitive name which is specified by user if this is
+ * for creation.
+ */
+ if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) {
+ if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
+ return 0;
+ }
+ return 1;
}

const struct dentry_operations jfs_ci_dentry_operations =
{
.d_hash = jfs_ci_hash,
.d_compare = jfs_ci_compare,
+ .d_revalidate = jfs_ci_revalidate,
};
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/fs/ncpfs/dir.c 2010-11-09 23:22:49.000000000 +1100
@@ -76,7 +76,9 @@ const struct inode_operations ncp_dir_in
*/
static int ncp_lookup_validate(struct dentry *, struct nameidata *);
static int ncp_hash_dentry(struct dentry *, struct qstr *);
-static int ncp_compare_dentry (struct dentry *, struct qstr *, struct qstr *);
+static int ncp_compare_dentry(const struct dentry *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
static int ncp_delete_dentry(const struct dentry *);

static const struct dentry_operations ncp_dentry_operations =
@@ -114,10 +116,10 @@ static inline int ncp_preserve_entry_cas

#define ncp_preserve_case(i) (ncp_namespace(i) != NW_NS_DOS)

-static inline int ncp_case_sensitive(struct dentry *dentry)
+static inline int ncp_case_sensitive(const struct inode *inode)
{
#ifdef CONFIG_NCPFS_NFS_NS
- return ncp_namespace(dentry->d_inode) == NW_NS_NFS;
+ return ncp_namespace(inode) == NW_NS_NFS;
#else
return 0;
#endif /* CONFIG_NCPFS_NFS_NS */
@@ -130,12 +132,15 @@ static inline int ncp_case_sensitive(str
static int
ncp_hash_dentry(struct dentry *dentry, struct qstr *this)
{
- if (!ncp_case_sensitive(dentry)) {
+ struct inode *inode = dentry->d_inode;
+
+ if (!ncp_case_sensitive(inode)) {
+ struct super_block *sb = dentry->d_sb;
struct nls_table *t;
unsigned long hash;
int i;

- t = NCP_IO_TABLE(dentry);
+ t = NCP_IO_TABLE(sb);
hash = init_name_hash();
for (i=0; i<this->len ; i++)
hash = partial_name_hash(ncp_tolower(t, this->name[i]),
@@ -146,15 +151,19 @@ ncp_hash_dentry(struct dentry *dentry, s
}

static int
-ncp_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+ncp_compare_dentry(const struct dentry *parent,
+ const struct dentry *dentry, const struct inode *inode,
+ unsigned int len, const char *str, const struct qstr *name)
{
- if (a->len != b->len)
+ struct super_block *sb = dentry->d_sb;
+
+ if (len != name->len)
return 1;

- if (ncp_case_sensitive(dentry))
- return strncmp(a->name, b->name, a->len);
+ if (ncp_case_sensitive(inode))
+ return strncmp(str, name->name, len);

- return ncp_strnicmp(NCP_IO_TABLE(dentry), a->name, b->name, a->len);
+ return ncp_strnicmp(NCP_IO_TABLE(sb), str, name->name, len);
}

/*
Index: linux-2.6/fs/ncpfs/ncplib_kernel.h
===================================================================
--- linux-2.6.orig/fs/ncpfs/ncplib_kernel.h 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/fs/ncpfs/ncplib_kernel.h 2010-11-09 22:11:10.000000000 +1100
@@ -135,7 +135,7 @@ int ncp__vol2io(struct ncp_server *, uns
const unsigned char *, unsigned int, int);

#define NCP_ESC ':'
-#define NCP_IO_TABLE(dentry) (NCP_SERVER((dentry)->d_inode)->nls_io)
+#define NCP_IO_TABLE(sb) (NCP_SBP(sb)->nls_io)
#define ncp_tolower(t, c) nls_tolower(t, c)
#define ncp_toupper(t, c) nls_toupper(t, c)
#define ncp_strnicmp(t, s1, s2, len) \
@@ -150,15 +150,15 @@ int ncp__io2vol(unsigned char *, unsigne
int ncp__vol2io(unsigned char *, unsigned int *,
const unsigned char *, unsigned int, int);

-#define NCP_IO_TABLE(dentry) NULL
+#define NCP_IO_TABLE(sb) NULL
#define ncp_tolower(t, c) tolower(c)
#define ncp_toupper(t, c) toupper(c)
#define ncp_io2vol(S,m,i,n,k,U) ncp__io2vol(m,i,n,k,U)
#define ncp_vol2io(S,m,i,n,k,U) ncp__vol2io(m,i,n,k,U)


-static inline int ncp_strnicmp(struct nls_table *t, const unsigned char *s1,
- const unsigned char *s2, int len)
+static inline int ncp_strnicmp(const struct nls_table *t,
+ const unsigned char *s1, const unsigned char *s2, int len)
{
while (len--) {
if (tolower(*s1++) != tolower(*s2++))
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt 2010-11-09 23:22:49.000000000 +1100
@@ -842,7 +842,9 @@ the VFS uses a default. As of kernel 2.6
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash)(struct dentry *, struct qstr *);
- int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+ int (*d_compare)(const struct dentry *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
@@ -854,9 +856,26 @@ struct dentry_operations {
dcache. Most filesystems leave this as NULL, because all their
dentries in the dcache are valid

- d_hash: called when the VFS adds a dentry to the hash table
+ d_hash: called when the VFS adds a dentry to the hash table. The first
+ dentry passed to d_hash is the parent directory that the name is
+ to be hashed into.

- d_compare: called when a dentry should be compared with another
+ d_compare: called to compare a dentry name with a given name. The first
+ dentry is the parent of the dentry to be compared, the second is
+ the dentry itself. inode, len, and name string are properties of
+ the dentry to be compared. qstr is the name to compare it with.
+
+ Must be constant and idempotent, and should not take locks if
+ possible, and should not or store into the dentry or inodes.
+ Should not dereference pointers outside the dentry or inodes without
+ lots of care (eg. d_parent, d_inode shouldn't be used).
+
+ However, our vfsmount is pinned, and RCU held, so the dentries and
+ inodes won't disappear, neither will our sb or filesystem module.
+ ->i_sb and ->d_sb may be used.
+
+ It is a tricky calling convention because it needs to be called under
+ "rcu-walk", ie. without any locks or references on things.

d_delete: called when the last reference to a dentry is dropped and the
dcache is deciding whether or not to cache it. Return 1 to delete
Index: linux-2.6/fs/isofs/namei.c
===================================================================
--- linux-2.6.orig/fs/isofs/namei.c 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/fs/isofs/namei.c 2010-11-09 22:11:10.000000000 +1100
@@ -37,7 +37,8 @@ isofs_cmp(struct dentry *dentry, const c

qstr.name = compare;
qstr.len = dlen;
- return dentry->d_op->d_compare(dentry, &dentry->d_name, &qstr);
+ return dentry->d_op->d_compare(NULL, NULL, NULL,
+ dentry->d_name.len, dentry->d_name.name, &qstr);
}

/*
Index: linux-2.6/include/linux/ncp_fs.h
===================================================================
--- linux-2.6.orig/include/linux/ncp_fs.h 2010-11-09 22:10:58.000000000 +1100
+++ linux-2.6/include/linux/ncp_fs.h 2010-11-09 22:11:10.000000000 +1100
@@ -184,13 +184,13 @@ struct ncp_entry_info {
__u8 file_handle[6];
};

-static inline struct ncp_server *NCP_SBP(struct super_block *sb)
+static inline struct ncp_server *NCP_SBP(const struct super_block *sb)
{
return sb->s_fs_info;
}

#define NCP_SERVER(inode) NCP_SBP((inode)->i_sb)
-static inline struct ncp_inode_info *NCP_FINFO(struct inode *inode)
+static inline struct ncp_inode_info *NCP_FINFO(const struct inode *inode)
{
return container_of(inode, struct ncp_inode_info, vfs_inode);
}
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking 2010-11-09 22:10:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/Locking 2010-11-09 23:22:49.000000000 +1100
@@ -11,7 +11,9 @@ be able to use diff(1).
prototypes:
int (*d_revalidate)(struct dentry *, int);
int (*d_hash) (struct dentry *, struct qstr *);
- int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+ int (*d_compare)(const struct dentry *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting 2010-11-09 22:11:10.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting 2010-11-09 23:22:49.000000000 +1100
@@ -337,3 +337,10 @@ must be done in the RCU callback.
unreferenced dentries, and is now only called when the dentry refcount goes to
0. Even on 0 refcount transition, it must be able to tolerate being called 0, 1,
or more times (eg. constant, idempotent).
+
+---
+[mandatory]
+
+ .d_compare() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.

2010-11-09 13:03:26

by Nick Piggin

[permalink] [raw]
Subject: [patch 6/6] fs: d_hash change for rcu-walk

Change d_hash so it may be called from lock-free RCU lookups.

Signed-off-by: Nick Piggin <[email protected]>

---
Documentation/filesystems/Locking | 5 +++--
Documentation/filesystems/porting | 7 +++++++
Documentation/filesystems/vfs.txt | 8 ++++++--
fs/adfs/dir.c | 3 ++-
fs/affs/namei.c | 20 ++++++++++++--------
fs/cifs/dir.c | 5 +++--
fs/cifs/readdir.c | 2 +-
fs/dcache.c | 2 +-
fs/ecryptfs/inode.c | 4 ++--
fs/fat/namei_msdos.c | 3 ++-
fs/fat/namei_vfat.c | 8 +++++---
fs/gfs2/dentry.c | 3 ++-
fs/hfs/hfs_fs.h | 3 ++-
fs/hfs/string.c | 3 ++-
fs/hfsplus/hfsplus_fs.h | 3 ++-
fs/hfsplus/unicode.c | 3 ++-
fs/hpfs/dentry.c | 3 ++-
fs/isofs/inode.c | 28 ++++++++++++++++++----------
fs/jfs/namei.c | 3 ++-
fs/namei.c | 5 +++--
fs/ncpfs/dir.c | 10 +++++-----
fs/sysv/namei.c | 3 ++-
include/linux/dcache.h | 3 ++-
23 files changed, 88 insertions(+), 49 deletions(-)

Index: linux-2.6/fs/adfs/dir.c
===================================================================
--- linux-2.6.orig/fs/adfs/dir.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/adfs/dir.c 2010-11-09 23:22:37.000000000 +1100
@@ -201,7 +201,8 @@ const struct file_operations adfs_dir_op
};

static int
-adfs_hash(struct dentry *parent, struct qstr *qstr)
+adfs_hash(const struct dentry *parent, const struct inode *inode,
+ struct qstr *qstr)
{
const unsigned int name_len = ADFS_SB(parent->d_sb)->s_namelen;
const unsigned char *name;
Index: linux-2.6/fs/affs/namei.c
===================================================================
--- linux-2.6.orig/fs/affs/namei.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/affs/namei.c 2010-11-09 23:22:37.000000000 +1100
@@ -13,12 +13,14 @@
typedef int (*toupper_t)(int);

static int affs_toupper(int ch);
-static int affs_hash_dentry(struct dentry *, struct qstr *);
+static int affs_hash_dentry(const struct dentry *,
+ const struct inode *, struct qstr *);
static int affs_compare_dentry(const struct dentry *parent,
const struct dentry *dentry, const struct inode *inode,
unsigned int len, const char *str, const struct qstr *name);
static int affs_intl_toupper(int ch);
-static int affs_intl_hash_dentry(struct dentry *, struct qstr *);
+static int affs_intl_hash_dentry(const struct dentry *,
+ const struct inode *, struct qstr *);
static int affs_intl_compare_dentry(const struct dentry *parent,
const struct dentry *dentry, const struct inode *inode,
unsigned int len, const char *str, const struct qstr *name);
@@ -62,13 +64,13 @@ affs_get_toupper(struct super_block *sb)
* Note: the dentry argument is the parent dentry.
*/
static inline int
-__affs_hash_dentry(struct dentry *dentry, struct qstr *qstr, toupper_t toupper)
+__affs_hash_dentry(struct qstr *qstr, toupper_t toupper)
{
const u8 *name = qstr->name;
unsigned long hash;
int i;

- i = affs_check_name(qstr->name,qstr->len);
+ i = affs_check_name(qstr->name, qstr->len);
if (i)
return i;

@@ -82,14 +84,16 @@ __affs_hash_dentry(struct dentry *dentry
}

static int
-affs_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+affs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
- return __affs_hash_dentry(dentry, qstr, affs_toupper);
+ return __affs_hash_dentry(qstr, affs_toupper);
}
static int
-affs_intl_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+affs_intl_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
- return __affs_hash_dentry(dentry, qstr, affs_intl_toupper);
+ return __affs_hash_dentry(qstr, affs_intl_toupper);
}

static inline int __affs_compare_dentry(unsigned int len,
Index: linux-2.6/fs/cifs/dir.c
===================================================================
--- linux-2.6.orig/fs/cifs/dir.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/cifs/dir.c 2010-11-09 23:22:37.000000000 +1100
@@ -700,9 +700,10 @@ const struct dentry_operations cifs_dent
/* d_delete: cifs_d_delete, */ /* not needed except for debugging */
};

-static int cifs_ci_hash(struct dentry *dentry, struct qstr *q)
+static int cifs_ci_hash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *q)
{
- struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
+ struct nls_table *codepage = CIFS_SB(dentry->d_sb)->local_nls;
unsigned long hash;
int i;

Index: linux-2.6/fs/fat/namei_msdos.c
===================================================================
--- linux-2.6.orig/fs/fat/namei_msdos.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/fat/namei_msdos.c 2010-11-09 23:22:37.000000000 +1100
@@ -148,7 +148,8 @@ static int msdos_find(struct inode *dir,
* that the existing dentry can be used. The msdos fs routines will
* return ENOENT or EINVAL as appropriate.
*/
-static int msdos_hash(struct dentry *dentry, struct qstr *qstr)
+static int msdos_hash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
struct fat_mount_options *options = &MSDOS_SB(dentry->d_sb)->options;
unsigned char msdos_name[MSDOS_NAME];
Index: linux-2.6/fs/fat/namei_vfat.c
===================================================================
--- linux-2.6.orig/fs/fat/namei_vfat.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/fat/namei_vfat.c 2010-11-09 23:22:37.000000000 +1100
@@ -103,7 +103,8 @@ static unsigned int vfat_striptail_len(c
* that the existing dentry can be used. The vfat fs routines will
* return ENOENT or EINVAL as appropriate.
*/
-static int vfat_hash(struct dentry *dentry, struct qstr *qstr)
+static int vfat_hash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
qstr->hash = full_name_hash(qstr->name, vfat_striptail_len(qstr));
return 0;
@@ -115,9 +116,10 @@ static int vfat_hash(struct dentry *dent
* that the existing dentry can be used. The vfat fs routines will
* return ENOENT or EINVAL as appropriate.
*/
-static int vfat_hashi(struct dentry *dentry, struct qstr *qstr)
+static int vfat_hashi(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
- struct nls_table *t = MSDOS_SB(dentry->d_inode->i_sb)->nls_io;
+ struct nls_table *t = MSDOS_SB(dentry->d_sb)->nls_io;
const unsigned char *name;
unsigned int len;
unsigned long hash;
Index: linux-2.6/fs/gfs2/dentry.c
===================================================================
--- linux-2.6.orig/fs/gfs2/dentry.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/gfs2/dentry.c 2010-11-09 23:22:37.000000000 +1100
@@ -100,7 +100,8 @@ static int gfs2_drevalidate(struct dentr
return 0;
}

-static int gfs2_dhash(struct dentry *dentry, struct qstr *str)
+static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *str)
{
str->hash = gfs2_disk_hash(str->name, str->len);
return 0;
Index: linux-2.6/fs/hfs/hfs_fs.h
===================================================================
--- linux-2.6.orig/fs/hfs/hfs_fs.h 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/hfs/hfs_fs.h 2010-11-09 23:22:37.000000000 +1100
@@ -213,7 +213,8 @@ extern int hfs_part_find(struct super_bl
/* string.c */
extern const struct dentry_operations hfs_dentry_operations;

-extern int hfs_hash_dentry(struct dentry *, struct qstr *);
+extern int hfs_hash_dentry(const struct dentry *, const struct inode *,
+ struct qstr *);
extern int hfs_strcmp(const unsigned char *, unsigned int,
const unsigned char *, unsigned int);
extern int hfs_compare_dentry(const struct dentry *parent,
Index: linux-2.6/fs/hfs/string.c
===================================================================
--- linux-2.6.orig/fs/hfs/string.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/hfs/string.c 2010-11-09 23:22:37.000000000 +1100
@@ -51,7 +51,8 @@ static unsigned char caseorder[256] = {
/*
* Hash a string to an integer in a case-independent way
*/
-int hfs_hash_dentry(struct dentry *dentry, struct qstr *this)
+int hfs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *this)
{
const unsigned char *name = this->name;
unsigned int hash, len = this->len;
Index: linux-2.6/fs/hfsplus/hfsplus_fs.h
===================================================================
--- linux-2.6.orig/fs/hfsplus/hfsplus_fs.h 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/hfsplus/hfsplus_fs.h 2010-11-09 23:22:37.000000000 +1100
@@ -379,7 +379,8 @@ int hfsplus_strcasecmp(const struct hfsp
int hfsplus_strcmp(const struct hfsplus_unistr *, const struct hfsplus_unistr *);
int hfsplus_uni2asc(struct super_block *, const struct hfsplus_unistr *, char *, int *);
int hfsplus_asc2uni(struct super_block *, struct hfsplus_unistr *, const char *, int);
-int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str);
+int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *str);
int hfsplus_compare_dentry(const struct dentry *parent,
const struct dentry *dentry, const struct inode *inode,
unsigned int len, const char *str, const struct qstr *name);
Index: linux-2.6/fs/hfsplus/unicode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/unicode.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/hfsplus/unicode.c 2010-11-09 23:22:37.000000000 +1100
@@ -320,7 +320,8 @@ int hfsplus_asc2uni(struct super_block *
* Composed unicode characters are decomposed and case-folding is performed
* if the appropriate bits are (un)set on the superblock.
*/
-int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str)
+int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *str)
{
struct super_block *sb = dentry->d_sb;
const char *astr;
Index: linux-2.6/fs/hpfs/dentry.c
===================================================================
--- linux-2.6.orig/fs/hpfs/dentry.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/hpfs/dentry.c 2010-11-09 23:22:37.000000000 +1100
@@ -12,7 +12,8 @@
* Note: the dentry argument is the parent dentry.
*/

-static int hpfs_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+static int hpfs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
unsigned long hash;
int i;
Index: linux-2.6/fs/isofs/inode.c
===================================================================
--- linux-2.6.orig/fs/isofs/inode.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/isofs/inode.c 2010-11-09 23:22:37.000000000 +1100
@@ -26,8 +26,10 @@

#define BEQUIET

-static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
-static int isofs_hash(struct dentry *parent, struct qstr *qstr);
+static int isofs_hashi(const struct dentry *parent, const struct inode *inode,
+ struct qstr *qstr);
+static int isofs_hash(const struct dentry *parent, const struct inode *inode,
+ struct qstr *qstr);
static int isofs_dentry_cmpi(const struct dentry *parent,
const struct dentry *dentry, const struct inode *inode,
unsigned int len, const char *str, const struct qstr *name);
@@ -36,8 +38,10 @@ static int isofs_dentry_cmp(const struct
unsigned int len, const char *str, const struct qstr *name);

#ifdef CONFIG_JOLIET
-static int isofs_hashi_ms(struct dentry *parent, struct qstr *qstr);
-static int isofs_hash_ms(struct dentry *parent, struct qstr *qstr);
+static int isofs_hashi_ms(const struct dentry *parent, const struct inode *inode,
+ struct qstr *qstr);
+static int isofs_hash_ms(const struct dentry *parent, const struct inode *inode,
+ struct qstr *qstr);
static int isofs_dentry_cmpi_ms(const struct dentry *parent,
const struct dentry *dentry, const struct inode *inode,
unsigned int len, const char *str, const struct qstr *name);
@@ -175,7 +179,7 @@ struct iso9660_options{
* Compute the hash for the isofs name corresponding to the dentry.
*/
static int
-isofs_hash_common(struct dentry *dentry, struct qstr *qstr, int ms)
+isofs_hash_common(const struct dentry *dentry, struct qstr *qstr, int ms)
{
const char *name;
int len;
@@ -196,7 +200,7 @@ isofs_hash_common(struct dentry *dentry,
* Compute the hash for the isofs name corresponding to the dentry.
*/
static int
-isofs_hashi_common(struct dentry *dentry, struct qstr *qstr, int ms)
+isofs_hashi_common(const struct dentry *dentry, struct qstr *qstr, int ms)
{
const char *name;
int len;
@@ -251,13 +255,15 @@ static int isofs_dentry_cmp_common(
}

static int
-isofs_hash(struct dentry *dentry, struct qstr *qstr)
+isofs_hash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
return isofs_hash_common(dentry, qstr, 0);
}

static int
-isofs_hashi(struct dentry *dentry, struct qstr *qstr)
+isofs_hashi(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
return isofs_hashi_common(dentry, qstr, 0);
}
@@ -280,13 +286,15 @@ isofs_dentry_cmpi(const struct dentry *p

#ifdef CONFIG_JOLIET
static int
-isofs_hash_ms(struct dentry *dentry, struct qstr *qstr)
+isofs_hash_ms(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
return isofs_hash_common(dentry, qstr, 1);
}

static int
-isofs_hashi_ms(struct dentry *dentry, struct qstr *qstr)
+isofs_hashi_ms(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
return isofs_hashi_common(dentry, qstr, 1);
}
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/jfs/namei.c 2010-11-09 23:22:37.000000000 +1100
@@ -1574,7 +1574,8 @@ const struct file_operations jfs_dir_ope
.llseek = generic_file_llseek,
};

-static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
+static int jfs_ci_hash(const struct dentry *dir, const struct inode *inode,
+ struct qstr *this)
{
unsigned long hash;
int i;
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/ncpfs/dir.c 2010-11-09 23:22:37.000000000 +1100
@@ -75,7 +75,8 @@ const struct inode_operations ncp_dir_in
* Dentry operations routines
*/
static int ncp_lookup_validate(struct dentry *, struct nameidata *);
-static int ncp_hash_dentry(struct dentry *, struct qstr *);
+static int ncp_hash_dentry(const struct dentry *, const struct inode *,
+ struct qstr *);
static int ncp_compare_dentry(const struct dentry *,
const struct dentry *, const struct inode *,
unsigned int, const char *, const struct qstr *);
@@ -130,10 +131,9 @@ static inline int ncp_case_sensitive(con
* is case-sensitive.
*/
static int
-ncp_hash_dentry(struct dentry *dentry, struct qstr *this)
+ncp_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *this)
{
- struct inode *inode = dentry->d_inode;
-
if (!ncp_case_sensitive(inode)) {
struct super_block *sb = dentry->d_sb;
struct nls_table *t;
@@ -602,7 +602,7 @@ ncp_fill_cache(struct file *filp, void *
qname.hash = full_name_hash(qname.name, qname.len);

if (dentry->d_op && dentry->d_op->d_hash)
- if (dentry->d_op->d_hash(dentry, &qname) != 0)
+ if (dentry->d_op->d_hash(dentry, dentry->d_inode, &qname) != 0)
goto end_advance;

newdent = d_lookup(dentry, &qname);
Index: linux-2.6/fs/sysv/namei.c
===================================================================
--- linux-2.6.orig/fs/sysv/namei.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/sysv/namei.c 2010-11-09 23:22:37.000000000 +1100
@@ -27,7 +27,8 @@ static int add_nondir(struct dentry *den
return err;
}

-static int sysv_hash(struct dentry *dentry, struct qstr *qstr)
+static int sysv_hash(const struct dentry *dentry, const struct inode *inode,
+ struct qstr *qstr)
{
/* Truncate the name in place, avoids having to define a compare
function. */
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/include/linux/dcache.h 2010-11-09 23:22:37.000000000 +1100
@@ -133,7 +133,8 @@ enum dentry_d_lock_class

struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
- int (*d_hash)(struct dentry *, struct qstr *);
+ int (*d_hash)(const struct dentry *, const struct inode *,
+ struct qstr *);
int (*d_compare)(const struct dentry *,
const struct dentry *, const struct inode *,
unsigned int, const char *, const struct qstr *);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/namei.c 2010-11-09 23:22:37.000000000 +1100
@@ -731,7 +731,8 @@ static int do_lookup(struct nameidata *n
* to use its own hash..
*/
if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
- int err = nd->path.dentry->d_op->d_hash(nd->path.dentry, name);
+ int err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+ nd->path.dentry->d_inode, name);
if (err < 0)
return err;
}
@@ -1134,7 +1135,7 @@ static struct dentry *__lookup_hash(stru
* to use its own hash..
*/
if (base->d_op && base->d_op->d_hash) {
- err = base->d_op->d_hash(base, name);
+ err = base->d_op->d_hash(base, inode, name);
dentry = ERR_PTR(err);
if (err < 0)
goto out;
Index: linux-2.6/fs/cifs/readdir.c
===================================================================
--- linux-2.6.orig/fs/cifs/readdir.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/cifs/readdir.c 2010-11-09 23:22:37.000000000 +1100
@@ -79,7 +79,7 @@ cifs_readdir_lookup(struct dentry *paren
cFYI(1, "For %s", name->name);

if (parent->d_op && parent->d_op->d_hash)
- parent->d_op->d_hash(parent, name);
+ parent->d_op->d_hash(parent, parent->d_inode, name);
else
name->hash = full_name_hash(name->name, name->len);

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/dcache.c 2010-11-09 23:22:37.000000000 +1100
@@ -1474,7 +1474,7 @@ struct dentry *d_hash_and_lookup(struct
*/
name->hash = full_name_hash(name->name, name->len);
if (dir->d_op && dir->d_op->d_hash) {
- if (dir->d_op->d_hash(dir, name) < 0)
+ if (dir->d_op->d_hash(dir, dir->d_inode, name) < 0)
goto out;
}
dentry = d_lookup(dir, name);
Index: linux-2.6/fs/ecryptfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/inode.c 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/fs/ecryptfs/inode.c 2010-11-09 23:22:37.000000000 +1100
@@ -454,7 +454,7 @@ static struct dentry *ecryptfs_lookup(st
lower_name.hash = ecryptfs_dentry->d_name.hash;
if (lower_dir_dentry->d_op && lower_dir_dentry->d_op->d_hash) {
rc = lower_dir_dentry->d_op->d_hash(lower_dir_dentry,
- &lower_name);
+ lower_dir_dentry->d_inode, &lower_name);
if (rc < 0)
goto out_d_drop;
}
@@ -489,7 +489,7 @@ static struct dentry *ecryptfs_lookup(st
lower_name.hash = full_name_hash(lower_name.name, lower_name.len);
if (lower_dir_dentry->d_op && lower_dir_dentry->d_op->d_hash) {
rc = lower_dir_dentry->d_op->d_hash(lower_dir_dentry,
- &lower_name);
+ lower_dir_dentry->d_inode, &lower_name);
if (rc < 0)
goto out_d_drop;
}
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/Documentation/filesystems/Locking 2010-11-09 23:22:37.000000000 +1100
@@ -10,7 +10,8 @@ be able to use diff(1).
--------------------------- dentry_operations --------------------------
prototypes:
int (*d_revalidate)(struct dentry *, int);
- int (*d_hash) (struct dentry *, struct qstr *);
+ int (*d_hash)(const struct dentry *, const struct inode *,
+ struct qstr *);
int (*d_compare)(const struct dentry *,
const struct dentry *, const struct inode *,
unsigned int, const char *, const struct qstr *);
@@ -23,7 +24,7 @@ be able to use diff(1).
none have BKL
dcache_lock rename_lock ->d_lock may block
d_revalidate: no no no yes
-d_hash no no no yes
+d_hash no no no no
d_compare: no yes no no
d_delete: yes no yes no
d_release: no no no yes
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt 2010-11-09 23:22:37.000000000 +1100
@@ -841,7 +841,8 @@ the VFS uses a default. As of kernel 2.6

struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
- int (*d_hash)(struct dentry *, struct qstr *);
+ int (*d_hash)(const struct dentry *, const struct inode *,
+ struct qstr *);
int (*d_compare)(const struct dentry *,
const struct dentry *, const struct inode *,
unsigned int, const char *, const struct qstr *);
@@ -858,7 +859,10 @@ struct dentry_operations {

d_hash: called when the VFS adds a dentry to the hash table. The first
dentry passed to d_hash is the parent directory that the name is
- to be hashed into.
+ to be hashed into. The inode is the dentry's inode.
+
+ Same locking and synchronisation rules as d_compare regarding
+ what is safe to dereference etc.

d_compare: called to compare a dentry name with a given name. The first
dentry is the parent of the dentry to be compared, the second is
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting 2010-11-09 23:22:35.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting 2010-11-09 23:22:37.000000000 +1100
@@ -344,3 +344,10 @@ or more times (eg. constant, idempotent)
.d_compare() calling convention and locking rules are significantly
changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
look at examples of other filesystems) for guidance.
+
+---
+[mandatory]
+
+ .d_hash() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.

2010-11-09 14:19:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

Nick Piggin <[email protected]> writes:

> So here is the inode RCU code. It's obviously not worth doing until the
> actual rcu-walk path walking is in, but I'd like to get opinions on it.
> It would be nice to merge it in Al's tree at some point, though.

I read the patch. It was quite monotonous (I guess that's a good thing)
But it wasn't clear to me why you added the INIT_LIST_HEAD()s
everywhere. Is this for stopping parallel walkers?

Ok there's a comment in the doc: "VFS expects it to be initialized"
Is that really true today? I don't think the old code does that.

Other than that it seems straight forward.

Reviewed-by: Andi Kleen <[email protected]>

-Andi

--
[email protected] -- Speaking for myself only.

2010-11-09 16:03:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 9, 2010 at 4:46 AM, Nick Piggin <[email protected]> wrote:
> So here is the inode RCU code. It's obviously not worth doing until the
> actual rcu-walk path walking is in, but I'd like to get opinions on it.
> It would be nice to merge it in Al's tree at some point, though.

Remind me why it wasn't sufficient to just use SLAB_DESTROY_BY_RCU?

Especially if we still lock things for the actual (few) inode list
operations, the added complexity of actually freeing _individual_
inodes by RCU seems to be a bad thing.

The only thing we care about is the pathname walk - there are no other
inode operations that are common enough to worry about. And the only
thing _that_ needs is the ability to look at the inode under RCU, and
SLAB_DESTROY_BY_RCU should be entirely sufficient for that.

But we had some discussion about this long ago, and I may have
forgotten some of the context.

Linus

2010-11-09 16:21:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 08:02:38AM -0800, Linus Torvalds wrote:
> Remind me why it wasn't sufficient to just use SLAB_DESTROY_BY_RCU?

Dave sent a patch for it, which looks much better to me. Nick thinks
it doesn't work for his store free path walk, but I haven't seen an
explanation why exactly.

> The only thing we care about is the pathname walk - there are no other
> inode operations that are common enough to worry about. And the only
> thing _that_ needs is the ability to look at the inode under RCU, and
> SLAB_DESTROY_BY_RCU should be entirely sufficient for that.

It might be worth it for inode lookup. While it's shadowed by the
dcache hash we still hit it a lot, especially for NFS serving.

2010-11-09 16:21:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

Le mardi 09 novembre 2010 à 08:02 -0800, Linus Torvalds a écrit :
> On Tue, Nov 9, 2010 at 4:46 AM, Nick Piggin <[email protected]> wrote:
> > So here is the inode RCU code. It's obviously not worth doing until the
> > actual rcu-walk path walking is in, but I'd like to get opinions on it.
> > It would be nice to merge it in Al's tree at some point, though.
>
> Remind me why it wasn't sufficient to just use SLAB_DESTROY_BY_RCU?
>
> Especially if we still lock things for the actual (few) inode list
> operations, the added complexity of actually freeing _individual_
> inodes by RCU seems to be a bad thing.
>
> The only thing we care about is the pathname walk - there are no other
> inode operations that are common enough to worry about. And the only
> thing _that_ needs is the ability to look at the inode under RCU, and
> SLAB_DESTROY_BY_RCU should be entirely sufficient for that.
>
> But we had some discussion about this long ago, and I may have
> forgotten some of the context.
>

David Chinner sent a patch using SLAB_DESTROY_BY_RCU

You can see problems using this fancy thing :

- Need to use slab ctor() to not overwrite some sensitive fields of
reused inodes.
(spinlock, next pointer)

- Fancy algo to detect an inode moved from one chain to another. Lookups
should be able to detect and restart their loop.

- After a match, need to get a stable reference on inode (lock), then
recheck the keys to make sure the target inode is the right one.

http://www.gossamer-threads.com/lists/linux/kernel/1298710

A bit tricky, but doable IMHO, especially if covering only the main
lookup. The seldom used can still get a (spin)lock.


2010-11-09 16:24:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 3/6] fs: dcache documentation cleanup

The patch looks good and I'm fine with sending it off to Linus ASAP.
But the commit message should not contain misc blurbs like the one
below.

2010-11-09 16:25:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 4/6] fs: d_delete change

The patch looks fine to me, and I'm also fine with merging it ASAP.
But the patch subject and commit message are not very descriptive.

2010-11-09 16:25:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 5/6] fs: d_compare change for rcu-walk

On Wed, Nov 10, 2010 at 12:02:35AM +1100, Nick Piggin wrote:
> cifs and jfs contain the non-trivial changes, where they no longer
> overwrite dentry name in their d_compare(), but rather use the
> d_revalidate() method for ensuring case preservation in the presence
> of negative dentries, taken from fatfs.

Please split these out to separate patches, and make sure you get
a review and testing from the jfs and cifs maintainers. I don't think
this is .37 material any more.

2010-11-09 17:09:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
>
> You can see problems using this fancy thing :
>
> - Need to use slab ctor() to not overwrite some sensitive fields of
> reused inodes.
> ?(spinlock, next pointer)

Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
cannot initialize some fields in the allocation path, because they may
end up being still used while allocating a new (well, re-used) entry.

However, I think that in the long run we pretty much _have_ to do that
anyway, because the "free each inode separately with RCU" is a real
overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
go that way. And I think SLAB_DESTROY_BY_RCU is the "normal" way to do
these kinds of things anyway, so I actually think it's "simpler", if
only because it's the common pattern.

(Put another way: it might not be less code, and it might have its own
subtle issues, but they are _shared_ subtle issues with all the other
SLAB_DESTROY_BY_RCU users, so we hopefully have a better understanding
of them)

> - Fancy algo to detect an inode moved from one chain to another. Lookups
> should be able to detect and restart their loop.

So this is where I think we should just use locks unless we have hard
numbers to say that being clever is worth it.

I do realize that some loads look up inodes directly, but at the same
time I really think that we should absolutely target the whole "RCU
path lookup" first. And that one has no inode lookup at all, it's just
a dentry->d_inode pointer derefeence.

So let's not mix in NFSD loads into the discussion yet - it's a
separate thing, and if we want to make that whole code use RCU later,
that's fine. But let's really keep it "later", because it's not
_nearly_ as important as the path walking.

> - After a match, need to get a stable reference on inode (lock), then
> recheck the keys to make sure the target inode is the right one.

Again, this is only an issue for non-dentry lookup. For the dentry
case, we know that if the dentry still exists, then the inode still
exists. So we don't need to check a stable inode pointer if we just
verify the stability of the dentry - and we'll have to do that anyway
obviously.

So I really think that the dentry lookup is the thing that should
primarily drive this. And that will not in any way preclude us from
looking at the non-dentry case _later_, and worrying about the details
there at some later date.

In other words: let's bite off the complexity in small chunks. Let's
keep the inode lock approach for now for the actual inode lists and
hash lookups. I think they are almost entirely independent issues from
the dentry path.

Linus

2010-11-09 17:15:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> Again, this is only an issue for non-dentry lookup. For the dentry
> case, we know that if the dentry still exists, then the inode still
> exists. So we don't need to check a stable inode pointer if we just
> verify the stability of the dentry - and we'll have to do that anyway
> obviously.

If the dentry still exists we have a reference on the inode and never
call into the inode hash.

> In other words: let's bite off the complexity in small chunks. Let's
> keep the inode lock approach for now for the actual inode lists and
> hash lookups. I think they are almost entirely independent issues from
> the dentry path.

I'm defintively in favour of splitting things into small chunks. I
don't particularly care how we do it. inode_lock scaling seems the
most simple bit to me, and even that turned out to be a massive
amount of work to do properly.

Doing the dentry_lock splitup last starts to look more and more
interesting given how messy inode_lock is, though.

2010-11-09 21:36:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 03:19:02PM +0100, Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
>
> > So here is the inode RCU code. It's obviously not worth doing until the
> > actual rcu-walk path walking is in, but I'd like to get opinions on it.
> > It would be nice to merge it in Al's tree at some point, though.
>
> I read the patch. It was quite monotonous (I guess that's a good thing)
> But it wasn't clear to me why you added the INIT_LIST_HEAD()s
> everywhere. Is this for stopping parallel walkers?
>
> Ok there's a comment in the doc: "VFS expects it to be initialized"
> Is that really true today? I don't think the old code does that.

It is in the inode_init_once pile, so yes it has to be returned
to the allocator initialized.

>
> Other than that it seems straight forward.
>
> Reviewed-by: Andi Kleen <[email protected]>

Thanks,
Nick

2010-11-09 21:44:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 08:02:38AM -0800, Linus Torvalds wrote:
> On Tue, Nov 9, 2010 at 4:46 AM, Nick Piggin <[email protected]> wrote:
> > So here is the inode RCU code. It's obviously not worth doing until the
> > actual rcu-walk path walking is in, but I'd like to get opinions on it.
> > It would be nice to merge it in Al's tree at some point, though.
>
> Remind me why it wasn't sufficient to just use SLAB_DESTROY_BY_RCU?
>
> Especially if we still lock things for the actual (few) inode list
> operations, the added complexity of actually freeing _individual_
> inodes by RCU seems to be a bad thing.

Oh yes, the list operations themselves do not need RCU at all. Not
even rapid hash lookups if you have fine grained hash locking --
it just doesn't matter.


> The only thing we care about is the pathname walk - there are no other
> inode operations that are common enough to worry about. And the only
> thing _that_ needs is the ability to look at the inode under RCU, and
> SLAB_DESTROY_BY_RCU should be entirely sufficient for that.
>
> But we had some discussion about this long ago, and I may have
> forgotten some of the context.

Yes, dentry->inode resolution is *the* only inode lookup we really
care about, definitely.

The big problem with SLAB_DESTROY_BY_RCU there is that we don't
have a reference on the dentry, so inode can go away. And we don't
have anything to speculatively pin the inode and recheck it, because
we're not allowed to store into it.

When we *do* have an inode, it's really hairy inspecting the inode and
calling i_ops etc. with no idea if it is there, in the middle of
being freed, or allocated for someone else.

What the plan was is to do the plain RCU approach, and get everybody
happy with the concept of rcu-walk path walking (which is already
enough complexity). At the _very_ worst case that DESTROY_BY_RCU
must be implemented and no other viable alternatives, we can use
i_lock for that. It takes away some benefit, but there would still
be much fewer atomics and less contention on common paths.

But I don't think RCU is actually going to hurt noticably. Definitely
we need numbers before committing to such complexity off the bat.

2010-11-09 21:48:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 11:21:31AM -0500, Christoph Hellwig wrote:
> On Tue, Nov 09, 2010 at 08:02:38AM -0800, Linus Torvalds wrote:
> > Remind me why it wasn't sufficient to just use SLAB_DESTROY_BY_RCU?
>
> Dave sent a patch for it, which looks much better to me.

It's horribly complex and why even do any of that for icache by
itself? What is going on there really? Do you think RCU inodes
are really important for inode hash lookup?

> Nick thinks
> it doesn't work for his store free path walk, but I haven't seen an
> explanation why exactly.

I posted explanations several times, and the code is there to
look at. I haven't seen an explanation from you about how it should
be used with rcu-walk, and why we shouldn't go with the simpler
RCU approach first and then evaulate an incremental patch.

>
> > The only thing we care about is the pathname walk - there are no other
> > inode operations that are common enough to worry about. And the only
> > thing _that_ needs is the ability to look at the inode under RCU, and
> > SLAB_DESTROY_BY_RCU should be entirely sufficient for that.
>
> It might be worth it for inode lookup. While it's shadowed by the
> dcache hash we still hit it a lot, especially for NFS serving.

No, you would never make inode RCU for that case alone. That's
crazy. _If_ NFS serving really hits that bottleneck, then fine
grained hash locking makes it go away. So it's not a justification
for anything.

2010-11-09 21:55:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 12:15:22PM -0500, Christoph Hellwig wrote:
> On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> > Again, this is only an issue for non-dentry lookup. For the dentry
> > case, we know that if the dentry still exists, then the inode still
> > exists. So we don't need to check a stable inode pointer if we just
> > verify the stability of the dentry - and we'll have to do that anyway
> > obviously.
>
> If the dentry still exists we have a reference on the inode and never
> call into the inode hash.

That would be nice. Unfortuately I don't know if the dentry still
exists.


> > In other words: let's bite off the complexity in small chunks. Let's
> > keep the inode lock approach for now for the actual inode lists and
> > hash lookups. I think they are almost entirely independent issues from
> > the dentry path.
>
> I'm defintively in favour of splitting things into small chunks. I
> don't particularly care how we do it. inode_lock scaling seems the
> most simple bit to me, and even that turned out to be a massive
> amount of work to do properly.

That is because the locking model was made much more complex and less
regular than it needed to be. If you have a model where i_lock ==
inode_lock for the context of that inode, it's simple and restructuring
the code can happen _in parallel_ rather than with dependencies on the
inode locking.

The several inode data structures are *trivial*. Simple structures,
trivial operations to insert/remove/lookup. The *hard* part is locking
the actual inode itself and ensuring it is not subject to unwanted
concurrency. If you make i_lock exclude everything withot exception,
then it's not hard to verify it.

> Doing the dentry_lock splitup last starts to look more and more
> interesting given how messy inode_lock is, though.

It's not actually, if the locking is done right. And it will need to
be this time because yes it is more complex than icache, so ad hoc
approach won't work.

2010-11-09 22:05:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
> >
> > You can see problems using this fancy thing :
> >
> > - Need to use slab ctor() to not overwrite some sensitive fields of
> > reused inodes.
> > ?(spinlock, next pointer)
>
> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
> cannot initialize some fields in the allocation path, because they may
> end up being still used while allocating a new (well, re-used) entry.
>
> However, I think that in the long run we pretty much _have_ to do that
> anyway, because the "free each inode separately with RCU" is a real
> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
> go that way.

This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
is going to be *much* heavier in creat/unlink (so that 10-20% cost would
look more like a few %), and any real workload is going to have much
less intensive pattern.

Much of the performance hit comes from draining the allocator's queues
and going into slow paths there, the remainder is due to cache coldness
of the new objects being allocated, and rcu queueing.

But if you allocate and free in larger batches, RCU and non-RCU have
the same biggest problem of draining slab queues and getting cache
cold objects.

So that remaining few % I suspect goes down to a few points of a %.

Anyway, I actually have tried to measure a regression on slightly more
realistic inode heavy workloads like fs_mark, and couldn't measure any.
On the flip side, I could measure huge improvements with rcu-walk, so
I think it is a valid tradeoff, with an eye to having a few fallback
plans if we really need them.


> And I think SLAB_DESTROY_BY_RCU is the "normal" way to do
> these kinds of things anyway, so I actually think it's "simpler", if
> only because it's the common pattern.
>
> (Put another way: it might not be less code, and it might have its own
> subtle issues, but they are _shared_ subtle issues with all the other
> SLAB_DESTROY_BY_RCU users, so we hopefully have a better understanding
> of them)
>
> > - Fancy algo to detect an inode moved from one chain to another. Lookups
> > should be able to detect and restart their loop.
>
> So this is where I think we should just use locks unless we have hard
> numbers to say that being clever is worth it.

Yeah, it's really not a bit deal.


> > - After a match, need to get a stable reference on inode (lock), then
> > recheck the keys to make sure the target inode is the right one.
>
> Again, this is only an issue for non-dentry lookup. For the dentry
> case, we know that if the dentry still exists, then the inode still
> exists. So we don't need to check a stable inode pointer if we just
> verify the stability of the dentry - and we'll have to do that anyway
> obviously.

With rcu-walk, we don't know that a dentry still exists -- we can't
store anything to it to pin it. I come back and verify after the
fact that it hasn't changed or gone away, and that enables us to know
that the inode has not gone away too.

But in the intermediate stage of doing access verification on the inode,
it really sucks if it suddenly comes back as something else. Not saying
it is impossible, but the traditional easy model of "lock, re-check" for
DESTROY_BY_RCU does not work with rcu-walk. Believe me I've looked at
doing it.

2010-11-09 22:06:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/6] fs: dcache documentation cleanup

On Tue, Nov 09, 2010 at 11:24:15AM -0500, Christoph Hellwig wrote:
> The patch looks good and I'm fine with sending it off to Linus ASAP.
> But the commit message should not contain misc blurbs like the one
> below.

It's not part of the changelog.

2010-11-09 22:08:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 4/6] fs: d_delete change

On Tue, Nov 09, 2010 at 11:25:16AM -0500, Christoph Hellwig wrote:
> The patch looks fine to me, and I'm also fine with merging it ASAP.
> But the patch subject and commit message are not very descriptive.

How is the commit message not descriptive? The first sentence
summarises exactly what the change does. The last says why it
is required. In the middle are some details.

2010-11-10 01:48:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 5/6] fs: d_compare change for rcu-walk

On Tue, Nov 09, 2010 at 11:25:55AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 10, 2010 at 12:02:35AM +1100, Nick Piggin wrote:
> > cifs and jfs contain the non-trivial changes, where they no longer
> > overwrite dentry name in their d_compare(), but rather use the
> > d_revalidate() method for ensuring case preservation in the presence
> > of negative dentries, taken from fatfs.
>
> Please split these out to separate patches, and make sure you get

Yeah, good suggestion.

> a review and testing from the jfs and cifs maintainers. I don't think
> this is .37 material any more.

We'll see. It's pretty trivial to test really, and method reused
from fatfs.

2010-11-10 14:47:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

> It is in the inode_init_once pile, so yes it has to be returned
> to the allocator initialized.

The kmem_cache_create()s in the file systems don't pass constructors
today. So I don't see how this could work reliably.

-Andi

2010-11-10 16:28:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 3/6] fs: dcache documentation cleanup

On Wed, Nov 10, 2010 at 09:06:00AM +1100, Nick Piggin wrote:
> On Tue, Nov 09, 2010 at 11:24:15AM -0500, Christoph Hellwig wrote:
> > The patch looks good and I'm fine with sending it off to Linus ASAP.
> > But the commit message should not contain misc blurbs like the one
> > below.
>
> It's not part of the changelog.

It is when applying your series using git-am, that's the only reason I
noticed anyway.

2010-11-10 16:32:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 4/6] fs: d_delete change

On Wed, Nov 10, 2010 at 09:08:33AM +1100, Nick Piggin wrote:
> On Tue, Nov 09, 2010 at 11:25:16AM -0500, Christoph Hellwig wrote:
> > The patch looks fine to me, and I'm also fine with merging it ASAP.
> > But the patch subject and commit message are not very descriptive.
>
> How is the commit message not descriptive? The first sentence
> summarises exactly what the change does. The last says why it
> is required. In the middle are some details.

foo change is about as useless as a subject could be.

"fs: idempotent d_delete" from your old tree was much better.

As far as the commit message is concerned I think the most important
bit is that we do not call it from prune_one_dentry anymore, which is
the things that might matter to any complex filesystem maintainer
looking at the changelog.

The other things I didn't like was the introductionary blurb, but from
reading the answer to the previous comment is seems like that wsn't
intentional anyway.

2010-11-11 00:27:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 4/6] fs: d_delete change

On Wed, Nov 10, 2010 at 11:32:25AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 10, 2010 at 09:08:33AM +1100, Nick Piggin wrote:
> > On Tue, Nov 09, 2010 at 11:25:16AM -0500, Christoph Hellwig wrote:
> > > The patch looks fine to me, and I'm also fine with merging it ASAP.
> > > But the patch subject and commit message are not very descriptive.
> >
> > How is the commit message not descriptive? The first sentence
> > summarises exactly what the change does. The last says why it
> > is required. In the middle are some details.
>
> foo change is about as useless as a subject could be.
>
> "fs: idempotent d_delete" from your old tree was much better.

It's not only idempotent, though, so I thought it was better to
change it. Seeing as the change could not be summarised in a
changelog, at least the ambiguous subject would draw the reader
to look at the changelog.


> As far as the commit message is concerned I think the most important
> bit is that we do not call it from prune_one_dentry anymore, which is
> the things that might matter to any complex filesystem maintainer
> looking at the changelog.

See: first sentence of the changelog.


> The other things I didn't like was the introductionary blurb, but from
> reading the answer to the previous comment is seems like that wsn't
> intentional anyway.

Right, I'll switch to a different way of commenting that git-am
does not pick up.

2010-11-11 04:27:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Wed, Nov 10, 2010 at 03:47:21PM +0100, Andi Kleen wrote:
> > It is in the inode_init_once pile, so yes it has to be returned
> > to the allocator initialized.
>
> The kmem_cache_create()s in the file systems don't pass constructors
> today. So I don't see how this could work reliably.

They do, grep filesystems for inode_init_once

2010-11-11 22:08:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 4/6] fs: d_delete change

On Wed, Nov 10, 2010 at 4:27 PM, Nick Piggin <[email protected]> wrote:
>
> Right, I'll switch to a different way of commenting that git-am
> does not pick up.

The standard way is to just put it after the "---" thing. So you end
up having the actual kernel changelog at the top (which is, after all,
the summary and should be readable on its own, so making people read
that first makes perfect sense), and then put the extended explanation
that doesn't make sense for the changelog after the "---".

Of course, quite often you would want the commit message to be
exhaustive, so maybe the extended explanation ends up making sense for
the changelog too. But if it's more of a "discussion material" thing
(eg things like "Hi guys, this is _why_ I think this is the right
approach: ...") that way it's clear both to tools and by now to most
developers that the stuff that comes after the "---" is really
extended material.

Linus

2010-11-12 01:24:25

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Wed, Nov 10, 2010 at 9:05 AM, Nick Piggin <[email protected]> wrote:
> On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
>> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
>> >
>> > You can see problems using this fancy thing :
>> >
>> > - Need to use slab ctor() to not overwrite some sensitive fields of
>> > reused inodes.
>> > ?(spinlock, next pointer)
>>
>> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
>> cannot initialize some fields in the allocation path, because they may
>> end up being still used while allocating a new (well, re-used) entry.
>>
>> However, I think that in the long run we pretty much _have_ to do that
>> anyway, because the "free each inode separately with RCU" is a real
>> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
>> go that way.
>
> This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
> is going to be *much* heavier in creat/unlink (so that 10-20% cost would
> look more like a few %), and any real workload is going to have much
> less intensive pattern.

So to get some more precise numbers, on a new kernel, and on a nehalem
class CPU, creat/unlink busy loop on ramfs (worst possible case for inode
RCU), then inode RCU costs 12% more time.

If we go to ext4 over ramdisk, it's 4.2% slower. Btrfs is 4.3% slower, XFS
is about 4.9% slower.

Remember, this is on a ramdisk that's _hitting the CPU's L3 if not L2_
cache. A real disk, even a fast SSD, is going to do IO far slower.

And also remember that real workloads will not approach creat/unlink busy
loop behaviour of creating and destroying 800K files/s. So even if you were
creating and destroying 80K files per second per CPU, the overall slowdown
will be on the order of 0.4% (but really, we know that very few workloads
even do that much creat/unlink activity, otherwise we would be totally
bottlenecked on inode_lock long ago).

The next factor is that the slowdown from RCU is reduced if you creat and
destroy longer batches of inodes. If you create 1000, then destroy 1000
inodes in a busy loop, then the ramfs regression is reduced to a 4.5%
disadvantage with RCU, and ext4 disadvantage is down to 1%. Because you
lose a lot of your CPU cache advantages anyway.

And the fact is I have not been able to find anything except microbenchmarks
where I can detect any slowdown at all.

And you obviously have seen the actual benefits that come with this -- kernel
time to do path walking in your git workload is 2x faster even with
just a single
thread running.

So this is really not a "oh, maybe someone will see 10-20% slowdown", or even
1-2% slowdown. I would even be surprised at a 0.1-0.2% slowdown on a real
workload, but that would be about the order of magnitude I am prepared to live
with. If, in the very unlikely case we saw 1-2% type of magnitude, I would start
looking at improvements or ways to do SLAB_RCU.

Are you happy with that?

2010-11-12 04:49:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Thu, Nov 11, 2010 at 5:24 PM, Nick Piggin <[email protected]> wrote:
>
> So this is really not a "oh, maybe someone will see 10-20% slowdown", or even
> 1-2% slowdown.

You ignored my bigger issue: the _normal_ way - and the better way -
to handle these thingsis with SLAB_DESTROY_BY_RCU.

So what are the advantages of using the inferior approach? I really
don't see why you push the whole "free the damn things individually"
approach. We've had experience with that, and it's resulted in long
RCU queues with memory being held up in queuing etc (hopefully we've
improved on the queuing issues with RCU itself, but..)

So the thing is, SLAB_DESTROY_BY_RCU is just the RightThing(tm). Why fight it?

Linus

2010-11-12 06:02:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Thu, Nov 11, 2010 at 08:48:38PM -0800, Linus Torvalds wrote:
> On Thu, Nov 11, 2010 at 5:24 PM, Nick Piggin <[email protected]> wrote:
> >
> > So this is really not a "oh, maybe someone will see 10-20% slowdown", or even
> > 1-2% slowdown.
>
> You ignored my bigger issue: the _normal_ way - and the better way -
> to handle these thingsis with SLAB_DESTROY_BY_RCU.

Well I tried to answer that in the other threads.

SLAB_DESTROY_BY_RCU is indeed quite natural for a lot of RCU usages,
because even with standard RCU you almost always have the pattern like

rcu_read_lock();
obj = lookup_data_structure(key);
if (obj) {
lock(obj);
verify_obj_in_structure(obj, key);
/* blah... (eg. take refcount) */
}

And in this pattern, SLAB_DESTROY_BY_RCU takes almost zero work.

OK, but rcu-walk doesn't have that. In rcu-walk, we can't take a lock
or a reference on either the dentry _or_ the inode, because the whole
point is to reduce atomics (for single threaded performance), and
stores into shared cachelines along the path (for scalability).

It gets really interesting when you have crazy stuff going on like
inode->i_ops changing from underneath you while you're trying to do
permission lookups, or inode type changing from link to dir to reg
in the middle of the traversal.


> So what are the advantages of using the inferior approach? I really
> don't see why you push the whole "free the damn things individually"
> approach.

I'm not pushing _that_ aspect of it. I'm pushing the "don't go away and
come back as something else" aspect.

Yes it may be _possible_ to do store-free walking SLAB_DESTROY_BY_RCU,
and I have some ideas. But it is hairy. More hairy than rcu-walk, by
quite a long shot.

And the complexity and issues that need to be understood are basically
rcu-walk + *more*. So I want basic rcu-walk to be reviewed and basically
verified first, and we could think about the +more part after that.

Basically SLAB_DESTROY_BY_RCU is not exactly natural for a completely
store-free algorithm.

2010-11-12 06:49:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Fri, Nov 12, 2010 at 05:02:02PM +1100, Nick Piggin wrote:
> On Thu, Nov 11, 2010 at 08:48:38PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 11, 2010 at 5:24 PM, Nick Piggin <[email protected]> wrote:
> > >
> > > So this is really not a "oh, maybe someone will see 10-20% slowdown", or even
> > > 1-2% slowdown.
> >
> > You ignored my bigger issue: the _normal_ way - and the better way -
> > to handle these thingsis with SLAB_DESTROY_BY_RCU.
>
> Well I tried to answer that in the other threads.
>
> SLAB_DESTROY_BY_RCU is indeed quite natural for a lot of RCU usages,
> because even with standard RCU you almost always have the pattern like
>
> rcu_read_lock();
> obj = lookup_data_structure(key);
> if (obj) {
> lock(obj);
> verify_obj_in_structure(obj, key);
> /* blah... (eg. take refcount) */
> }
>
> And in this pattern, SLAB_DESTROY_BY_RCU takes almost zero work.
>
> OK, but rcu-walk doesn't have that. In rcu-walk, we can't take a lock
> or a reference on either the dentry _or_ the inode, because the whole
> point is to reduce atomics (for single threaded performance), and
> stores into shared cachelines along the path (for scalability).
>
> It gets really interesting when you have crazy stuff going on like
> inode->i_ops changing from underneath you while you're trying to do
> permission lookups, or inode type changing from link to dir to reg
> in the middle of the traversal.
>
>
> > So what are the advantages of using the inferior approach? I really
> > don't see why you push the whole "free the damn things individually"
> > approach.
>
> I'm not pushing _that_ aspect of it. I'm pushing the "don't go away and
> come back as something else" aspect.
>
> Yes it may be _possible_ to do store-free walking SLAB_DESTROY_BY_RCU,
> and I have some ideas. But it is hairy. More hairy than rcu-walk, by
> quite a long shot.

So in short, that is my justification. 12% is a worst case regression,
but the demonstration is obviously absurdly worst case, and is merely
there as a "ok, the sky won't fall on anybody's head" upper bound.

In reality, it's likely to be well under 0.1% in any real workload, even
an inode intensive one. So I much prefer to err on the side of less
complexity, to start with. There just isn't much risk of regression
AFAIKS, and much more risk of becoming unmaintainable too complex.

2010-11-12 17:34:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Thu, Nov 11, 2010 at 10:49 PM, Nick Piggin <[email protected]> wrote:
>
> In reality, it's likely to be well under 0.1% in any real workload, even
> an inode intensive one. So I much prefer to err on the side of less
> complexity, to start with. There just isn't much risk of regression
> AFAIKS, and much more risk of becoming unmaintainable too complex.

Well, I have to say that if we don't get this lockless path lookup
thing merged in the next merge window (ir 38-rc1), I'm going to be
personally very disappointed (*).

So yes, the "initial complexity" argument is certainly acceptable to
me. It does make me suspect something is wrong, though, because quite
frankly, the actual accesses to the inode during the lockless walk
should be very _very_ controlled anyway. And it's trivial to do a "is
this inode still the same one I started with" with zero locking, by
just checking that "dentry->d_inode" is the same after-the-fact and
checking that the dentry is still hashed. The inode type had better
_NOT_ change if the dentry pointer is still there.

So even if the type or i_ops changes, none of that should matter in
the least. Nobody should _care_. We might get two wildly different
results, but we have a trivial way to check whether the inode was
stable after-the-fact, and just punt if it wasn't. So it really smells
like if this is an issue, there's something wrong going on.

Linus

(*) To the point where if Al and Dave cannot end up agreeing on
something, I'll just instead end up making the executive decision to
screw the eternal nay-saying, and just merging your series. So just a
heads-up guys: healthy discussion is fine, but obstructionism will
_not_ work.

2010-11-12 23:17:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Fri, Nov 12, 2010 at 09:33:11AM -0800, Linus Torvalds wrote:
> On Thu, Nov 11, 2010 at 10:49 PM, Nick Piggin <[email protected]> wrote:
> >
> > In reality, it's likely to be well under 0.1% in any real workload, even
> > an inode intensive one. So I much prefer to err on the side of less
> > complexity, to start with. There just isn't much risk of regression
> > AFAIKS, and much more risk of becoming unmaintainable too complex.
>
> Well, I have to say that if we don't get this lockless path lookup
> thing merged in the next merge window (ir 38-rc1), I'm going to be
> personally very disappointed (*).

I'm trying to piece things together. I'll hopefully be able to post
patches again soon for review.


> So yes, the "initial complexity" argument is certainly acceptable to
> me. It does make me suspect something is wrong, though, because quite
> frankly, the actual accesses to the inode during the lockless walk
> should be very _very_ controlled anyway. And it's trivial to do a "is
> this inode still the same one I started with" with zero locking, by
> just checking that "dentry->d_inode" is the same after-the-fact and
> checking that the dentry is still hashed. The inode type had better
> _NOT_ change if the dentry pointer is still there.
>
> So even if the type or i_ops changes, none of that should matter in
> the least. Nobody should _care_. We might get two wildly different
> results, but we have a trivial way to check whether the inode was
> stable after-the-fact, and just punt if it wasn't. So it really smells
> like if this is an issue, there's something wrong going on.

Yes you are very right about that, it is actually possible to use
seqlocks and re-checking things to verify it after the fact. And
this is why I'm optimisic that we can tackle any and all regressions
that come up.

An example of where it can get more complicated:

A filesystem has an ->op function which gets the sb from inode->i_sb,
and then does the container_of thing, to get the filesystem specific
superblock so it can check flags to determine something (eg. whether
it is case sensitive or not).

If the inode goes away and i_sb can change, this can oops. We basically
just need to further tighten rules and further audit everyone. I'm not
saying it can't be done, I'm just saying it's not _totally_ trivail like
the usual DESTROY_BY_RCU pattern, so let's just see what incremental
patches look like.

I'm glad you agree at this point (and if it does turn out to be much
simpler than I anticipate, then hey that's great, we can just move to
DESTROY_BY_RCU even quicker).

Thanks,
Nick

2010-11-15 01:00:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Fri, Nov 12, 2010 at 12:24:21PM +1100, Nick Piggin wrote:
> On Wed, Nov 10, 2010 at 9:05 AM, Nick Piggin <[email protected]> wrote:
> > On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> >> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
> >> >
> >> > You can see problems using this fancy thing :
> >> >
> >> > - Need to use slab ctor() to not overwrite some sensitive fields of
> >> > reused inodes.
> >> > ?(spinlock, next pointer)
> >>
> >> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
> >> cannot initialize some fields in the allocation path, because they may
> >> end up being still used while allocating a new (well, re-used) entry.
> >>
> >> However, I think that in the long run we pretty much _have_ to do that
> >> anyway, because the "free each inode separately with RCU" is a real
> >> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
> >> go that way.
> >
> > This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
> > is going to be *much* heavier in creat/unlink (so that 10-20% cost would
> > look more like a few %), and any real workload is going to have much
> > less intensive pattern.
>
> So to get some more precise numbers, on a new kernel, and on a nehalem
> class CPU, creat/unlink busy loop on ramfs (worst possible case for inode
> RCU), then inode RCU costs 12% more time.
>
> If we go to ext4 over ramdisk, it's 4.2% slower. Btrfs is 4.3% slower, XFS
> is about 4.9% slower.

That is actually significant because in the current XFS performance
using delayed logging for pure metadata operations is not that far
off ramdisk results. Indeed, the simple test:

while (i++ < 1000 * 1000) {
int fd = open("foo", O_CREAT|O_RDWR, 777);
unlink("foo");
close(fd);
}

Running 8 instances of the above on XFS, each in their own
directory, on a single sata drive with delayed logging enabled with
my current working XFS tree (includes SLAB_DESTROY_BY_RCU inode
cache and XFS inode cache, and numerous other XFS scalability
enhancements) currently runs at ~250k files/s. It took ~33s for 8 of
those loops above to complete in parallel, and was 100% CPU bound...

> Remember, this is on a ramdisk that's _hitting the CPU's L3 if not L2_
> cache. A real disk, even a fast SSD, is going to do IO far slower.

The amount of IO done during the above test? A single log write -
one IO. Hence it isn't going to be any faster on a RAM disk, an SSD, a
large RAID array, etc because it is CPU bound, not IO bound. IOWs,
that 5% difference in CPU usage is significant for XFS regardless of
the storage....

> And also remember that real workloads will not approach creat/unlink busy
> loop behaviour of creating and destroying 800K files/s.

Perhaps not a local workload, but I expect to see things like
fileservers getting hit with these sorts of loads (i.e. hundreds of
thousands of create/unlinks a second). Especially as XFS now has
the journal scalability to make this possible...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-15 04:21:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Mon, Nov 15, 2010 at 12:00:27PM +1100, Dave Chinner wrote:
> On Fri, Nov 12, 2010 at 12:24:21PM +1100, Nick Piggin wrote:
> > On Wed, Nov 10, 2010 at 9:05 AM, Nick Piggin <[email protected]> wrote:
> > > On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> > >> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
> > >> >
> > >> > You can see problems using this fancy thing :
> > >> >
> > >> > - Need to use slab ctor() to not overwrite some sensitive fields of
> > >> > reused inodes.
> > >> > ?(spinlock, next pointer)
> > >>
> > >> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
> > >> cannot initialize some fields in the allocation path, because they may
> > >> end up being still used while allocating a new (well, re-used) entry.
> > >>
> > >> However, I think that in the long run we pretty much _have_ to do that
> > >> anyway, because the "free each inode separately with RCU" is a real
> > >> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
> > >> go that way.
> > >
> > > This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
> > > is going to be *much* heavier in creat/unlink (so that 10-20% cost would
> > > look more like a few %), and any real workload is going to have much
> > > less intensive pattern.
> >
> > So to get some more precise numbers, on a new kernel, and on a nehalem
> > class CPU, creat/unlink busy loop on ramfs (worst possible case for inode
> > RCU), then inode RCU costs 12% more time.
> >
> > If we go to ext4 over ramdisk, it's 4.2% slower. Btrfs is 4.3% slower, XFS
> > is about 4.9% slower.
>
> That is actually significant because in the current XFS performance
> using delayed logging for pure metadata operations is not that far
> off ramdisk results. Indeed, the simple test:
>
> while (i++ < 1000 * 1000) {
> int fd = open("foo", O_CREAT|O_RDWR, 777);
> unlink("foo");
> close(fd);
> }
>
> Running 8 instances of the above on XFS, each in their own
> directory, on a single sata drive with delayed logging enabled with
> my current working XFS tree (includes SLAB_DESTROY_BY_RCU inode
> cache and XFS inode cache, and numerous other XFS scalability
> enhancements) currently runs at ~250k files/s. It took ~33s for 8 of
> those loops above to complete in parallel, and was 100% CPU bound...

David,

This is 30K inodes per second per CPU, versus nearly 800K per second
number that I measured the 12% slowdown with. About 25x slower. How you
are trying to FUD this as doing anything but confirming my hypothesis, I
don't know and honestly I don't want to know so don't try to tell me.

That you are still at this campaign of negative and destructive crap
baffles me. All the effort you've put into negativity and obstruction,
you could have gone and got some *actual* numbers. But no, you're
obviously more interested in FUD.

I don't know what is funnier, that I keep responding to you, or that you
keep expecting me to reply when you ignore all _my_ comments about your
patches and ignore all the reasons I have given to want to merge things
my way (eg. my response to SLAB_DESTROY_BY_RCU patch where you ignored
all my feedback, and you ignore this entire thread about how and why I
want to approach rcu-walk in the way I do).

But that's it. I have explained my position, offered reasonable answers
to all questions and objections, shown good numbers, and given
strategies that regressions can be solved with. That's all I need to do.

I acknowledge the very small potential for regressions with inode-RCU
for a very small number of users. I also weigh that against complexity
and reviewability, and against the very large speedups for very many
users that rcu-walk can give. And also offered approaches for ways that
future work can resolve any regressions. You ignored all that.

You show me no respect or cortesy and seem to take me as a big joke. So
at this point I'm not interested in your handwaving or opinions. Is that
clear? Until you 1) treat me the way you expect to be treated, and 2)
actaully have something constructive, do do not cc me. I do not care.

2010-11-16 03:03:00

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Mon, Nov 15, 2010 at 03:21:00PM +1100, Nick Piggin wrote:
> On Mon, Nov 15, 2010 at 12:00:27PM +1100, Dave Chinner wrote:
> > On Fri, Nov 12, 2010 at 12:24:21PM +1100, Nick Piggin wrote:
> > > On Wed, Nov 10, 2010 at 9:05 AM, Nick Piggin <[email protected]> wrote:
> > > > On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> > > >> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <[email protected]> wrote:
> > > >> >
> > > >> > You can see problems using this fancy thing :
> > > >> >
> > > >> > - Need to use slab ctor() to not overwrite some sensitive fields of
> > > >> > reused inodes.
> > > >> > ?(spinlock, next pointer)
> > > >>
> > > >> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
> > > >> cannot initialize some fields in the allocation path, because they may
> > > >> end up being still used while allocating a new (well, re-used) entry.
> > > >>
> > > >> However, I think that in the long run we pretty much _have_ to do that
> > > >> anyway, because the "free each inode separately with RCU" is a real
> > > >> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
> > > >> go that way.
> > > >
> > > > This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
> > > > is going to be *much* heavier in creat/unlink (so that 10-20% cost would
> > > > look more like a few %), and any real workload is going to have much
> > > > less intensive pattern.
> > >
> > > So to get some more precise numbers, on a new kernel, and on a nehalem
> > > class CPU, creat/unlink busy loop on ramfs (worst possible case for inode
> > > RCU), then inode RCU costs 12% more time.
> > >
> > > If we go to ext4 over ramdisk, it's 4.2% slower. Btrfs is 4.3% slower, XFS
> > > is about 4.9% slower.
> >
> > That is actually significant because in the current XFS performance
> > using delayed logging for pure metadata operations is not that far
> > off ramdisk results. Indeed, the simple test:
> >
> > while (i++ < 1000 * 1000) {
> > int fd = open("foo", O_CREAT|O_RDWR, 777);
> > unlink("foo");
> > close(fd);
> > }
> >
> > Running 8 instances of the above on XFS, each in their own
> > directory, on a single sata drive with delayed logging enabled with
> > my current working XFS tree (includes SLAB_DESTROY_BY_RCU inode
> > cache and XFS inode cache, and numerous other XFS scalability
> > enhancements) currently runs at ~250k files/s. It took ~33s for 8 of
> > those loops above to complete in parallel, and was 100% CPU bound...
>
> David,
>
> This is 30K inodes per second per CPU, versus nearly 800K per second
> number that I measured the 12% slowdown with. About 25x slower.

Hi Nick, the ramfs (800k/12%) numbers are not the context I was
responding to - you're comparing apples to oranges. I was responding to
the "XFS [on a ramdisk] is about 4.9% slower" result.

> How you
> are trying to FUD this as doing anything but confirming my hypothesis, I
> don't know and honestly I don't want to know so don't try to tell me.

Hardly FUD. I thought it important to point out that your
filesystem-on-ramdisk numbers are not theoretical at all - we can
acheive the same level of performance on a single SATA drive for
this workload on XFS. Therefore, the 5% difference in performance
you've measured on a ramdisk will definitely be visible in the real
world and we need to consider it in that context, not as a
"theoretical concern".

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-16 03:49:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 16, 2010 at 02:02:43PM +1100, Dave Chinner wrote:
> On Mon, Nov 15, 2010 at 03:21:00PM +1100, Nick Piggin wrote:
> > This is 30K inodes per second per CPU, versus nearly 800K per second
> > number that I measured the 12% slowdown with. About 25x slower.
>
> Hi Nick, the ramfs (800k/12%) numbers are not the context I was
> responding to - you're comparing apples to oranges. I was responding to
> the "XFS [on a ramdisk] is about 4.9% slower" result.

Well xfs on ramdisk was (85k/4.9%). A a lower number, like 30k, I would
expect that should be around 1-2% perhaps. And when in the context of a
real workload that is not 100% CPU bound on creating and destroying a
single inode, I expect that to be well under 1%.

Like I said, I never disputed a potential regression, but I have looked
for workloads that have a detectable regression and have not found any.
And I have extrapolated microbenchmark numbers to show that it's not
going to be a _big_ problem even in a worst case scenario.

Based on that, and the fact that complexity of rcu-walk goes up quite a
bit with SLAB_DESTROY_BY_RCU, I explained why I want to go with RCU
first. I am much more worried about complexity and review coverage and
maintainability than a small worst case regression in some rare
(non-existant) workloads.

I say non existant because if anybody actaully has a workload that tries
to create and destroy inodes fast, they would have been yelling at the
top of their lungs already because the vfs was totally unscalable for
them, let alone most filesystems. Ie. we have quite strong empirical
evidence that people are _not_ hitting this terribly hard.

The first real complaint was really not long ago by google, and that was
due to sockets workload, not files.

I think that is quite a reasonable position. I would definitely like to
see and hear of any numbers or workloads that you can suggest I try, but
really I still want to merge rcu-walk first and then do an incremental
SLAB RCU patch on top of that as my preferred approach to merging it.

Thanks,
Nick

2010-11-17 01:13:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 16, 2010 at 02:49:06PM +1100, Nick Piggin wrote:
> On Tue, Nov 16, 2010 at 02:02:43PM +1100, Dave Chinner wrote:
> > On Mon, Nov 15, 2010 at 03:21:00PM +1100, Nick Piggin wrote:
> > > This is 30K inodes per second per CPU, versus nearly 800K per second
> > > number that I measured the 12% slowdown with. About 25x slower.
> >
> > Hi Nick, the ramfs (800k/12%) numbers are not the context I was
> > responding to - you're comparing apples to oranges. I was responding to
> > the "XFS [on a ramdisk] is about 4.9% slower" result.
>
> Well xfs on ramdisk was (85k/4.9%).

How many threads? On a 2.26GHz nehalem-class Xeon CPU, I'm seeing:

threads files/s
1 45k
2 70k
4 130k
8 230k

With scalability mainly limited by the dcache_lock. I'm not sure
what you 85k number relates to in the above chart. Is it a single
thread number, or something else? If it is a single thread, can you
run you numbers again with a thread per CPU?

> A a lower number, like 30k, I would
> expect that should be around 1-2% perhaps. And when in the context of a
> real workload that is not 100% CPU bound on creating and destroying a
> single inode, I expect that to be well under 1%.

I don't think we are comparing apples to apples. I cannot see how you
can get mainline XFS to sustain 85kfiles/s/cpu across any number of
CPUs, so lets make sure we are comparing the same thing....

> Like I said, I never disputed a potential regression, but I have looked
> for workloads that have a detectable regression and have not found any.
> And I have extrapolated microbenchmark numbers to show that it's not
> going to be a _big_ problem even in a worst case scenario.

How did you extrapolate the numbers?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-17 04:18:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Wed, Nov 17, 2010 at 12:12:54PM +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2010 at 02:49:06PM +1100, Nick Piggin wrote:
> > On Tue, Nov 16, 2010 at 02:02:43PM +1100, Dave Chinner wrote:
> > > On Mon, Nov 15, 2010 at 03:21:00PM +1100, Nick Piggin wrote:
> > > > This is 30K inodes per second per CPU, versus nearly 800K per second
> > > > number that I measured the 12% slowdown with. About 25x slower.
> > >
> > > Hi Nick, the ramfs (800k/12%) numbers are not the context I was
> > > responding to - you're comparing apples to oranges. I was responding to
> > > the "XFS [on a ramdisk] is about 4.9% slower" result.
> >
> > Well xfs on ramdisk was (85k/4.9%).
>
> How many threads? On a 2.26GHz nehalem-class Xeon CPU, I'm seeing:
>
> threads files/s
> 1 45k
> 2 70k
> 4 130k
> 8 230k
>
> With scalability mainly limited by the dcache_lock. I'm not sure
> what you 85k number relates to in the above chart. Is it a single

Yes, a single thread. 86385 inodes created and destroyed per second.
upstream kernel.


> thread number, or something else? If it is a single thread, can you
> run you numbers again with a thread per CPU?

I don't have my inode scalability series in one piece at the moment,
so that would be pointless. Why don't you run RCU numbers?


> > A a lower number, like 30k, I would
> > expect that should be around 1-2% perhaps. And when in the context of a
> > real workload that is not 100% CPU bound on creating and destroying a
> > single inode, I expect that to be well under 1%.
>
> I don't think we are comparing apples to apples. I cannot see how you
> can get mainline XFS to sustain 85kfiles/s/cpu across any number of
> CPUs, so lets make sure we are comparing the same thing....

What do you mean? You are not comparing anything. I am giving you
numbers that I got, comparing RCU and non-RCU inode freeing and holding
everything else constant, and it most certainly is apples to apples.

>
> > Like I said, I never disputed a potential regression, but I have looked
> > for workloads that have a detectable regression and have not found any.
> > And I have extrapolated microbenchmark numbers to show that it's not
> > going to be a _big_ problem even in a worst case scenario.
>
> How did you extrapolate the numbers?

I've covered that several times, including in this thread. So I'll go
out on a limb and assume you've read that. So let me ask you, what do
you disagree about what I've written? And what workloads have you been
using to measure inode work with? If it's not a setup that I can
replicate here, then perhaps you could run RCU numbers there.

2010-11-17 05:56:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Wed, Nov 17, 2010 at 03:18:12PM +1100, Nick Piggin wrote:
> On Wed, Nov 17, 2010 at 12:12:54PM +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2010 at 02:49:06PM +1100, Nick Piggin wrote:
> > > On Tue, Nov 16, 2010 at 02:02:43PM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 15, 2010 at 03:21:00PM +1100, Nick Piggin wrote:
> > > > > This is 30K inodes per second per CPU, versus nearly 800K per second
> > > > > number that I measured the 12% slowdown with. About 25x slower.
> > > >
> > > > Hi Nick, the ramfs (800k/12%) numbers are not the context I was
> > > > responding to - you're comparing apples to oranges. I was responding to
> > > > the "XFS [on a ramdisk] is about 4.9% slower" result.
> > >
> > > Well xfs on ramdisk was (85k/4.9%).
> >
> > How many threads? On a 2.26GHz nehalem-class Xeon CPU, I'm seeing:
> >
> > threads files/s
> > 1 45k
> > 2 70k
> > 4 130k
> > 8 230k
> >
> > With scalability mainly limited by the dcache_lock. I'm not sure
> > what you 85k number relates to in the above chart. Is it a single
>
> Yes, a single thread. 86385 inodes created and destroyed per second.
> upstream kernel.

92K actually, with delaylog. Still a long way off ext4, which itself is
a very long way off ramfs. Do you have lots of people migrating off xfs
to ext4 because it is so much quicker? I doubt it because xfs I'm sure
is often as good or better at what people are actually doing.

Yes it's great if it can avoid hitting the disk and runing from cache,
but my point was that real workloads are not going to follow the busy
loop create/destroy pattern, in the slightest. And real IO actually will
get in the way quite often.

So you are going to be a long way off even the 4-5% theoretical worst
case. Every time a creat is followed by something other than an unlink
(eg. another creat, a lock, some IO, some calculation, a write), will
see that gap reduced.

So the closest creat/unlink intensive benchmark I have found was fs_mark
with 0 file size, and no syncs. It's basically just inode create and
destroy in something slightly better than a busy loop.

I ran that on ramdisk, on xfs with delaylog. 100 times.

Average files/s:
vanilla - 39648.76
rcu - 39916.66

Ie. RCU actually had a slightly higher mean, but assuming a normal
distribution, there was no significant difference at 95% confidence.

Mind you, this is still 40k files/s -- so it's still on the high side
compared to anything doing _real_ work, doing real IO, anything non
trivial with the damn things.

So there. I re state my case. I have put up the numbers, and I have
shown that even worst cases is not the end of the world. I don't know
why I've had to repeat it so many times, but honestly at this point I've
done enough. The case is closed until any *actual* significant numbers
to the contrary turn up.

I've been much more dilligent than most people at examining worst cases
and doing benchmarks, and we really don't hold up kernel development
beyond that, without a basis on actual numbers.

Thanks,
Nick

2010-11-17 06:04:13

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Wed, Nov 17, 2010 at 04:56:33PM +1100, Nick Piggin wrote:
> So there. I re state my case. I have put up the numbers, and I have
> shown that even worst cases is not the end of the world. I don't know
> why I've had to repeat it so many times, but honestly at this point I've
> done enough. The case is closed until any *actual* significant numbers
> to the contrary turn up.

Same goes for per-zone LRUs, FWIW.

You can't ask submitter of every single change to prove a negative.

At some point you have to move on with life and accept that regressions
are inevitable, performance regressions are inevitable, and that we have
a development model that is well geared by now to tolerate and resolve
regressions.