2009-04-11 12:01:54

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 0/9] File descriptor hot-unplug support


A couple of weeks ago I found myself looking at the uio, seeing that
it does not support pci hot-unplug, and thinking "Great yet another
implementation of hotunplug logic that needs to be added".

I decided to see what it would take to add a generic implementation of
the code we have for supporting hot unplugging devices in sysfs, proc,
sysctl, tty_io, and now almost in the tun driver.

Not long after I touched the tun driver and made it safe to delete the
network device while still holding it's file descriptor open I someone
else touch the code adding a different feature and my careful work
went up in flames. Which brought home another point at the best of it
this is ultimately complex tricky code that subsystems should not need
to worry about.

What makes this even more interesting is that in the presence of pci
hot-unplug it looks like most subsystems and most devices will have to
deal with the issue one way or another.

This infrastructure could also be used to implement sys_revoke and
when I could not think of a better name I have drawn on that.

The following changes draw on and generalize the work in tty_io sysfs,
proc, and sysctl and move it into the vfs level. Where the basic
primitives are running faster, and the solution more general.

The work is not complete. I have only fully converted proc. And
there are more places in the vfs that need to be touched. But it is
close enough the code works in practice and all of the core challenges
should have been solved, and the design should be clear.

Documentation/filesystems/vfs.txt | 4 +
drivers/char/pty.c | 2 +-
drivers/char/tty_io.c | 2 +-
fs/Makefile | 2 +-
fs/compat.c | 31 +++-
fs/fcntl.c | 32 +++--
fs/file_table.c | 189 ++++++++++++++++++---
fs/inode.c | 1 +
fs/ioctl.c | 39 +++--
fs/locks.c | 81 +++++++--
fs/open.c | 32 +++-
fs/proc/generic.c | 100 ++++--------
fs/proc/inode.c | 339 +------------------------------------
fs/proc/internal.h | 2 +
fs/proc/root.c | 2 +-
fs/read_write.c | 143 ++++++++++++----
fs/readdir.c | 14 ++-
fs/revoked_file.c | 181 ++++++++++++++++++++
fs/select.c | 17 ++-
fs/super.c | 49 +++---
fs/sysfs/bin.c | 193 +---------------------
include/linux/fs.h | 31 +++-
include/linux/mm.h | 3 +
include/linux/proc_fs.h | 4 -
mm/memory.c | 96 +++++++++++
25 files changed, 841 insertions(+), 748 deletions(-)

Eric


2009-04-11 12:03:40

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 1/9] mm: Introduce remap_file_mappings.


>From 819088d701e3a821db8e08ed83d46c00c27cc772 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <[email protected]>
Date: Tue, 7 Apr 2009 19:35:07 -0700
Subject:

When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to receive SIGBUS until the file is unmapped.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..96d8342 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -807,6 +807,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,

extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
+extern void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops);

#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..dcd0a3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2325,6 +2325,93 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+static void remap_vma(struct vm_area_struct *vma,
+ struct vm_operations_struct *vm_ops)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ unsigned long start_addr, end_addr, size;
+ struct mm_struct *mm;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+
+ /* Has this vma already been processed? */
+ if (vma->vm_ops == vm_ops)
+ return;
+
+ /* Switch out the locks so I can maninuplate this under the mm sem.
+ * Needed so I can call vm_ops->close.
+ */
+ mm = vma->vm_mm;
+ atomic_inc(&mm->mm_users);
+ spin_unlock(&mapping->i_mmap_lock);
+
+ /* Block page faults and other code modifying the mm. */
+ down_write(&mm->mmap_sem);
+
+ /* Lookup a vma for my file address */
+ vma = find_vma(mm, start_addr);
+ if (vma->vm_file != file)
+ goto out;
+ if (vma->vm_ops == vm_ops)
+ goto out;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+ size = end_addr - start_addr;
+
+ /* Unmap the vma */
+ zap_page_range(vma, start_addr, size, NULL);
+
+ /* Close the vma */
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+
+ /* Repurpose the vma */
+ vma->vm_private_data = NULL;
+ vma->vm_ops = vm_ops;
+ if (vm_ops->open)
+ vm_ops->open(vma);
+out:
+ up_write(&mm->mmap_sem);
+ spin_lock(&mapping->i_mmap_lock);
+}
+
+void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops)
+{
+ /* After file->f_ops has been changed update the vmas */
+ struct address_space *mapping = file->f_mapping;
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+
+ spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ if (vma->vm_ops == vm_ops)
+ continue;
+ remap_vma(vma, vm_ops);
+ goto restart_tree;
+ }
+
+restart_list:
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ if (vma->vm_ops == vm_ops)
+ continue;
+ remap_vma(vma, vm_ops);
+ goto restart_list;
+ }
+
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
--
1.6.1.2.350.g88cc

2009-04-11 12:05:49

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 2/9] mm: Implement generic support for revoking a mapping.


Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 9 +++++++++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 96d8342..3fcbb8e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -807,6 +807,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,

extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
+
+extern struct vm_operations_struct revoked_vm_ops;
extern void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops);

#ifdef CONFIG_MMU
diff --git a/mm/memory.c b/mm/memory.c
index dcd0a3c..f68c84e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2378,6 +2378,15 @@ out:
spin_lock(&mapping->i_mmap_lock);
}

+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+struct vm_operations_struct revoked_vm_ops = {
+ .fault = revoked_fault,
+};
+
void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops)
{
/* After file->f_ops has been changed update the vmas */
--
1.6.1.2.350.g88cc

2009-04-11 12:06:31

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 3/9] sysfs: Use remap_file_mappings.


Instead of wrapping all of the sysfs binary file vm operations
when the backing kobject goes away, we can more easily change
vm_ops on the vma when the backing kobject goes away.

Leading to simpler and more easily maintained code.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/sysfs/bin.c | 193 +-------------------------------------------------------
1 files changed, 2 insertions(+), 191 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 93e0c02..898163c 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -39,8 +39,6 @@ static DEFINE_MUTEX(sysfs_bin_lock);
struct bin_buffer {
struct mutex mutex;
void *buffer;
- int mmapped;
- struct vm_operations_struct *vm_ops;
struct file *file;
struct hlist_node list;
};
@@ -181,175 +179,6 @@ out_free:
return count;
}

-static void bin_vma_open(struct vm_area_struct *vma)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
-
- if (!bb->vm_ops || !bb->vm_ops->open)
- return;
-
- if (!sysfs_get_active_two(attr_sd))
- return;
-
- bb->vm_ops->open(vma);
-
- sysfs_put_active_two(attr_sd);
-}
-
-static void bin_vma_close(struct vm_area_struct *vma)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
-
- if (!bb->vm_ops || !bb->vm_ops->close)
- return;
-
- if (!sysfs_get_active_two(attr_sd))
- return;
-
- bb->vm_ops->close(vma);
-
- sysfs_put_active_two(attr_sd);
-}
-
-static int bin_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- int ret;
-
- if (!bb->vm_ops || !bb->vm_ops->fault)
- return VM_FAULT_SIGBUS;
-
- if (!sysfs_get_active_two(attr_sd))
- return VM_FAULT_SIGBUS;
-
- ret = bb->vm_ops->fault(vma, vmf);
-
- sysfs_put_active_two(attr_sd);
- return ret;
-}
-
-static int bin_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- int ret;
-
- if (!bb->vm_ops)
- return VM_FAULT_SIGBUS;
-
- if (!bb->vm_ops->page_mkwrite)
- return 0;
-
- if (!sysfs_get_active_two(attr_sd))
- return VM_FAULT_SIGBUS;
-
- ret = bb->vm_ops->page_mkwrite(vma, vmf);
-
- sysfs_put_active_two(attr_sd);
- return ret;
-}
-
-static int bin_access(struct vm_area_struct *vma, unsigned long addr,
- void *buf, int len, int write)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- int ret;
-
- if (!bb->vm_ops || !bb->vm_ops->access)
- return -EINVAL;
-
- if (!sysfs_get_active_two(attr_sd))
- return -EINVAL;
-
- ret = bb->vm_ops->access(vma, addr, buf, len, write);
-
- sysfs_put_active_two(attr_sd);
- return ret;
-}
-
-#ifdef CONFIG_NUMA
-static int bin_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- int ret;
-
- if (!bb->vm_ops || !bb->vm_ops->set_policy)
- return 0;
-
- if (!sysfs_get_active_two(attr_sd))
- return -EINVAL;
-
- ret = bb->vm_ops->set_policy(vma, new);
-
- sysfs_put_active_two(attr_sd);
- return ret;
-}
-
-static struct mempolicy *bin_get_policy(struct vm_area_struct *vma,
- unsigned long addr)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- struct mempolicy *pol;
-
- if (!bb->vm_ops || !bb->vm_ops->get_policy)
- return vma->vm_policy;
-
- if (!sysfs_get_active_two(attr_sd))
- return vma->vm_policy;
-
- pol = bb->vm_ops->get_policy(vma, addr);
-
- sysfs_put_active_two(attr_sd);
- return pol;
-}
-
-static int bin_migrate(struct vm_area_struct *vma, const nodemask_t *from,
- const nodemask_t *to, unsigned long flags)
-{
- struct file *file = vma->vm_file;
- struct bin_buffer *bb = file->private_data;
- struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
- int ret;
-
- if (!bb->vm_ops || !bb->vm_ops->migrate)
- return 0;
-
- if (!sysfs_get_active_two(attr_sd))
- return 0;
-
- ret = bb->vm_ops->migrate(vma, from, to, flags);
-
- sysfs_put_active_two(attr_sd);
- return ret;
-}
-#endif
-
-static struct vm_operations_struct bin_vm_ops = {
- .open = bin_vma_open,
- .close = bin_vma_close,
- .fault = bin_fault,
- .page_mkwrite = bin_page_mkwrite,
- .access = bin_access,
-#ifdef CONFIG_NUMA
- .set_policy = bin_set_policy,
- .get_policy = bin_get_policy,
- .migrate = bin_migrate,
-#endif
-};
-
static int mmap(struct file *file, struct vm_area_struct *vma)
{
struct bin_buffer *bb = file->private_data;
@@ -370,25 +199,7 @@ static int mmap(struct file *file, struct vm_area_struct *vma)
goto out_put;

rc = attr->mmap(kobj, attr, vma);
- if (rc)
- goto out_put;
-
- /*
- * PowerPC's pci_mmap of legacy_mem uses shmem_zero_setup()
- * to satisfy versions of X which crash if the mmap fails: that
- * substitutes a new vm_file, and we don't then want bin_vm_ops.
- */
- if (vma->vm_file != file)
- goto out_put;
-
- rc = -EINVAL;
- if (bb->mmapped && bb->vm_ops != vma->vm_ops)
- goto out_put;

- rc = 0;
- bb->mmapped = 1;
- bb->vm_ops = vma->vm_ops;
- vma->vm_ops = &bin_vm_ops;
out_put:
sysfs_put_active_two(attr_sd);
out_unlock:
@@ -475,9 +286,9 @@ void unmap_bin_file(struct sysfs_dirent *attr_sd)
mutex_lock(&sysfs_bin_lock);

hlist_for_each_entry(bb, tmp, &attr_sd->s_bin_attr.buffers, list) {
- struct inode *inode = bb->file->f_path.dentry->d_inode;
+ struct file *file = bb->file;

- unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+ remap_file_mappings(file, &revoked_vm_ops);
}

mutex_unlock(&sysfs_bin_lock);
--
1.6.1.2.350.g88cc

2009-04-11 12:07:58

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 4/9] vfs: Generalize the file_list


file_list_lock is held when files are being revoked
aka hung up on in tty_io, and also needs to be done
in a more general revoke case. The more general revoke
case also needs to sleep so I have converted file_list_lock
into a mutex.

To make it clear what is on the file list and when I have changed
file_move to file_add. The callers have been modified to to ensure
file_add is called exactly once on a file. In __dentry_open
file_add is called on everything except device files. In __ptmx_open
and __tty_open file_add is called to place the ttys on the tty_files
list.

To make using the file_list for efficient for handling revokes
I have moved the list head from the superblock to the inode.
This means only relevant files need to be looked at.

fs_may_remount_ro and mark_files_ro have been modified to
walk the inode list to find all of the inodes and then to
walk the file list on those inodes. It is a slightly slower
process but just as efficient and potentially more correct
as inodes may have some influence on the the rw state of the
filesystem that files do not.

Signed-off-by: Eric W. Biederman <[email protected]>
---
drivers/char/pty.c | 2 +-
drivers/char/tty_io.c | 2 +-
fs/file_table.c | 27 ++++++++++++++++++---------
fs/inode.c | 1 +
fs/open.c | 3 ++-
fs/super.c | 49 +++++++++++++++++++++++++++----------------------
include/linux/fs.h | 10 +++++-----
7 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index 31038a0..3ed304c 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -662,7 +662,7 @@ static int __ptmx_open(struct inode *inode, struct file *filp)

set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_add(filp, &tty->tty_files);

retval = devpts_pty_new(inode, tty->link);
if (retval)
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 66b99a2..22b978e 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -1836,7 +1836,7 @@ got_driver:
return PTR_ERR(tty);

filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_add(filp, &tty->tty_files);
check_tty_count(tty, "tty_open");
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_MASTER)
diff --git a/fs/file_table.c b/fs/file_table.c
index 54018fe..03d74b6 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -22,6 +22,7 @@
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
+#include <linux/writeback.h>

#include <asm/atomic.h>

@@ -31,7 +32,7 @@ struct files_stat_struct files_stat = {
};

/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+__cacheline_aligned_in_smp DEFINE_MUTEX(files_lock);

/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;
@@ -357,12 +358,12 @@ void put_filp(struct file *file)
}
}

-void file_move(struct file *file, struct list_head *list)
+void file_add(struct file *file, struct list_head *list)
{
if (!list)
return;
file_list_lock();
- list_move(&file->f_u.fu_list, list);
+ list_add(&file->f_u.fu_list, list);
file_list_unlock();
}

@@ -377,24 +378,32 @@ void file_kill(struct file *file)

int fs_may_remount_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *file;

/* Check that no files are currently opened for writing. */
file_list_lock();
- list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
- struct inode *inode = file->f_path.dentry->d_inode;
-
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
/* File with pending delete? */
if (inode->i_nlink == 0)
goto too_bad;

- /* Writeable file? */
- if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
- goto too_bad;
+ /* Regular file */
+ if (!S_ISREG(inode->i_mode))
+ continue;
+
+ list_for_each_entry(file, &inode->i_files, f_u.fu_list) {
+ /* Writeable file? */
+ if (file->f_mode & FMODE_WRITE)
+ goto too_bad;
+ }
}
+ spin_unlock(&inode_lock);
file_list_unlock();
return 1; /* Tis' cool bro. */
too_bad:
+ spin_unlock(&inode_lock);
file_list_unlock();
return 0;
}
diff --git a/fs/inode.c b/fs/inode.c
index d06d6d2..9682caf 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -238,6 +238,7 @@ void inode_init_once(struct inode *inode)
memset(inode, 0, sizeof(*inode));
INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_dentry);
+ INIT_LIST_HEAD(&inode->i_files);
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
spin_lock_init(&inode->i_data.tree_lock);
diff --git a/fs/open.c b/fs/open.c
index 377eb25..5e201cb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -828,7 +828,8 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
f->f_path.mnt = mnt;
f->f_pos = 0;
f->f_op = fops_get(inode->i_fop);
- file_move(f, &inode->i_sb->s_files);
+ if (!special_file(inode->i_mode))
+ file_add(f, &inode->i_files);

error = security_dentry_open(f, cred);
if (error)
diff --git a/fs/super.c b/fs/super.c
index 786fe7d..e55299c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -67,7 +67,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_more_io);
- INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
@@ -597,32 +596,38 @@ out:

static void mark_files_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *f;

retry:
file_list_lock();
- list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
- struct vfsmount *mnt;
- if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
- continue;
- if (!file_count(f))
- continue;
- if (!(f->f_mode & FMODE_WRITE))
- continue;
- f->f_mode &= ~FMODE_WRITE;
- if (file_check_writeable(f) != 0)
- continue;
- file_release_write(f);
- mnt = mntget(f->f_path.mnt);
- file_list_unlock();
- /*
- * This can sleep, so we can't hold
- * the file_list_lock() spinlock.
- */
- mnt_drop_write(mnt);
- mntput(mnt);
- goto retry;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ list_for_each_entry(f, &inode->i_files, f_u.fu_list) {
+ struct vfsmount *mnt;
+ if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+ continue;
+ if (!file_count(f))
+ continue;
+ if (!(f->f_mode & FMODE_WRITE))
+ continue;
+ f->f_mode &= ~FMODE_WRITE;
+ if (file_check_writeable(f) != 0)
+ continue;
+ file_release_write(f);
+ mnt = mntget(f->f_path.mnt);
+ spin_unlock(&inode_lock);
+ file_list_unlock();
+ /*
+ * This can sleep, so we can't hold
+ * the inode_lock spinlock.
+ */
+ mnt_drop_write(mnt);
+ mntput(mnt);
+ goto retry;
+ }
}
+ spin_unlock(&inode_lock);
file_list_unlock();
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 562d285..7805d20 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -656,6 +656,7 @@ struct inode {
struct list_head i_list;
struct list_head i_sb_list;
struct list_head i_dentry;
+ struct list_head i_files;
unsigned long i_ino;
atomic_t i_count;
unsigned int i_nlink;
@@ -878,9 +879,9 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
+extern struct mutex files_lock;
+#define file_list_lock() mutex_lock(&files_lock);
+#define file_list_unlock() mutex_unlock(&files_lock);

#define get_file(x) atomic_long_inc(&(x)->f_count)
#define file_count(x) atomic_long_read(&(x)->f_count)
@@ -1277,7 +1278,6 @@ struct super_block {
struct list_head s_io; /* parked for writeback */
struct list_head s_more_io; /* parked for more writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
- struct list_head s_files;
/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */
@@ -2116,7 +2116,7 @@ static inline void insert_inode_hash(struct inode *inode) {
}

extern struct file * get_empty_filp(void);
-extern void file_move(struct file *f, struct list_head *list);
+extern void file_add(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
#ifdef CONFIG_BLOCK
struct bio;
--
1.6.1.2.350.g88cc

2009-04-11 12:09:27

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file


Going forward fops_read_lock should be held whenever file->f_op is being
accesed and when the functions from f_op are executing.

In 4 subsystems sysfs, proc, and sysctl, and tty we have support
for modifing a file descriptor so that the underlying object
can go away. In looking at the problem of pci hotunplug
it appears that we potentially need that support for all file
descriptors except ones talking to files on filesystems. Even
on for file descriptors referring to files support for file
the underlying object going away is interesting for implementing
features like umount -f and sys_revoke.

The implementations in sysfs, proc and sysctl are all very similar
and are composed of several components.
- A reference count to track that the file operations are being used.
- An ability to flag the file as no longer being valid.
- An ability to wait until the reference count is no longer used.

Tracking when file_operations functions are running is done by holding
the fops_read_lock across their invocations.

Flagging when the file is no longer valid will be done by taking f_lock
and modifying f_op, with a set of file operations that will return
appropriate error codes,. roughly EIO from most operations, POLLERR
from poll, and 0 from reads, and setting FMODE_REVOKE.

Waiting until the functions are no longer being called is done with
by waiting until f_use goes to 0. Essentially the same as synchronize_srcu.

When implementing this I encountered an additional challenge. Ensuring
that f_op->release is called exactly once, in an appropriate context.

To ensure this I have taken several steps.
- file_kill is moved immediate after after frelease in __fput to ensure
the proper context is present even if fop_substitute calls release.

- open sets FMODE_RELEASE after the open succeeds (but before fops_read_unlock)
ensuring that fops_subsittute will know if release needs to be called
after it has finished waiting for all of the files.

- __fput samples fmode and f_op under f_lock and only calls __frelease
if FMODE_REVOKE has not happened and FMODE_RELEASE is pending. Leaving
it up to fops_subsitutate to call __frelease.

- fops_substituate calls __frelease in all cases if after waiting for
all users of a file to go to zero FMODE_RELEASE is still set.

Signed-off-by: Eric W. Biederman <[email protected]>
---
Documentation/filesystems/vfs.txt | 4 +
fs/file_table.c | 154 ++++++++++++++++++++++++++++++++++---
fs/open.c | 19 ++++-
include/linux/fs.h | 19 +++++
4 files changed, 181 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index deeeed0..2b115ba 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -807,6 +807,10 @@ otherwise noted.
splice_read: called by the VFS to splice data from file to a pipe. This
method is used by the splice(2) system call

+ awaken_all_waiters: Called in while revoking a file to wake up poll,
+ aio operations, fasync, and anything else blocked
+ indefinitely waiting for something to happen.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
diff --git a/fs/file_table.c b/fs/file_table.c
index 03d74b6..d216557 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,6 +23,7 @@
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
#include <linux/writeback.h>
+#include <linux/mm.h>

#include <asm/atomic.h>

@@ -204,7 +205,7 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
file->f_path.dentry = dentry;
file->f_path.mnt = mntget(mnt);
file->f_mapping = dentry->d_inode->i_mapping;
- file->f_mode = mode;
+ file->f_mode = mode | FMODE_RELEASE;
file->f_op = fop;

/*
@@ -255,6 +256,51 @@ void drop_file_write_access(struct file *file)
}
EXPORT_SYMBOL_GPL(drop_file_write_access);

+static void __frelease(struct file *file, struct inode *inode,
+ const struct file_operations *f_op)
+{
+ locks_remove_flock(file);
+ if (unlikely(file->f_flags & FASYNC)) {
+ if (f_op && f_op->fasync)
+ file->f_op->fasync(-1, file, 0);
+ }
+ if (f_op && f_op->release)
+ f_op->release(inode, file);
+
+ if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
+ cdev_put(inode->i_cdev);
+}
+
+static void frelease(struct file *file, struct inode *inode)
+
+{
+ const struct file_operations *f_op;
+ int fops_idx;
+ fmode_t mode;
+ int need_release = 0;
+
+ fops_idx = fops_read_lock(file);
+
+ /*
+ * Ensure that __frelease is called exactly once.
+ *
+ * We don't do anything if FMODE_REVOKED is set because
+ * we will have a f_op without the proper release method
+ * and so can not cleanup from this path.
+ */
+ spin_lock(&file->f_lock);
+ f_op = file->f_op;
+ mode = file->f_mode;
+ need_release = (mode & (FMODE_REVOKED | FMODE_RELEASE)) == FMODE_RELEASE;
+ if (need_release)
+ file->f_mode = mode & ~FMODE_RELEASE;
+ spin_unlock(&file->f_lock);
+
+ if (need_release)
+ __frelease(file, inode, f_op);
+ fops_read_unlock(file, fops_idx);
+}
+
/* __fput is called from task context when aio completion releases the last
* last use of a struct file *. Do not use otherwise.
*/
@@ -272,21 +318,14 @@ void __fput(struct file *file)
* in the file cleanup chain.
*/
eventpoll_release(file);
- locks_remove_flock(file);

- if (unlikely(file->f_flags & FASYNC)) {
- if (file->f_op && file->f_op->fasync)
- file->f_op->fasync(-1, file, 0);
- }
- if (file->f_op && file->f_op->release)
- file->f_op->release(inode, file);
+ frelease(file, inode);
+ file_kill(file);
+
security_file_free(file);
ima_file_free(file);
- if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
- cdev_put(inode->i_cdev);
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- file_kill(file);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
@@ -296,6 +335,78 @@ void __fput(struct file *file)
mntput(mnt);
}

+int fops_substitute(struct file *file, const struct file_operations *f_op,
+ struct vm_operations_struct *vm_ops)
+{
+ /* Must be called with file_list_lock held */
+ /* This currently assumes that the new f_op does not need
+ * open or release to be called.
+ * This currently assumes that it will not be called twice
+ * on the same file.
+ */
+ const struct file_operations *old_f_op;
+ fmode_t mode;
+ int err;
+
+ err = -EINVAL;
+ f_op = fops_get(f_op);
+ if (!f_op)
+ goto out;
+ /*
+ * Ensure we have no new users of the old f_ops.
+ * Assignment order is important here.
+ */
+ spin_lock(&file->f_lock);
+ old_f_op = file->f_op;
+ rcu_assign_pointer(file->f_op, f_op);
+ file->f_mode |= FMODE_REVOKED;
+ spin_unlock(&file->f_lock);
+
+ /*
+ * Drain the existing uses of the original f_ops.
+ */
+ remap_file_mappings(file, vm_ops);
+ if (old_f_op->awaken_all_waiters)
+ old_f_op->awaken_all_waiters(file);
+
+ /*
+ * Wait until there are no more callers in the original
+ * file_operations methods.
+ */
+ while (atomic_long_read(&file->f_use) > 0)
+ schedule_timeout_interruptible(1);
+
+ /*
+ * Cleanup the data structures that were associated
+ * with the old fops.
+ */
+ spin_lock(&file->f_lock);
+ mode = file->f_mode;
+ file->f_mode = mode & ~FMODE_RELEASE;
+ spin_unlock(&file->f_lock);
+ if (mode & FMODE_RELEASE)
+ __frelease(file, file->f_path.dentry->d_inode, old_f_op);
+ fops_put(old_f_op);
+ file->private_data = NULL;
+ err = 0;
+out:
+ return err;
+}
+
+void inode_fops_substitute(struct inode *inode,
+ const struct file_operations *f_op,
+ struct vm_operations_struct *vm_ops)
+{
+ struct file *file;
+
+ file_list_lock();
+ /* Prevent new files from showing up with the old f_ops */
+ inode->i_fop = f_op;
+ list_for_each_entry(file, &inode->i_files, f_u.fu_list)
+ fops_substitute(file, f_op, vm_ops);
+ file_list_unlock();
+}
+
struct file *fget(unsigned int fd)
{
struct file *file;
@@ -358,12 +469,17 @@ void put_filp(struct file *file)
}
}

+void __file_add(struct file *file, struct list_head *list)
+{
+ list_add(&file->f_u.fu_list, list);
+}
+
void file_add(struct file *file, struct list_head *list)
{
if (!list)
return;
file_list_lock();
- list_add(&file->f_u.fu_list, list);
+ __file_add(file, list);
file_list_unlock();
}

@@ -376,6 +492,20 @@ void file_kill(struct file *file)
}
}

+int fops_read_lock(struct file *file)
+{
+ int revoked = (file->f_mode & FMODE_REVOKED);
+ if (likely(!revoked))
+ atomic_long_inc(&file->f_use);
+ return revoked;
+}
+
+void fops_read_unlock(struct file *file, int revoked)
+{
+ if (likely(!revoked))
+ atomic_long_dec(&file->f_use);
+}
+
int fs_may_remount_ro(struct super_block *sb)
{
struct inode *inode;
diff --git a/fs/open.c b/fs/open.c
index 5e201cb..0b75dde 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -808,7 +808,9 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
int (*open)(struct inode *, struct file *),
const struct cred *cred)
{
+ const struct file_operations *f_op;
struct inode *inode;
+ int fops_idx;
int error;

f->f_flags = flags;
@@ -827,21 +829,31 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
f->f_path.dentry = dentry;
f->f_path.mnt = mnt;
f->f_pos = 0;
+
+ file_list_lock();
f->f_op = fops_get(inode->i_fop);
if (!special_file(inode->i_mode))
- file_add(f, &inode->i_files);
+ __file_add(f, &inode->i_files);
+ file_list_unlock();
+
+ fops_idx = fops_read_lock(f);
+ f_op = rcu_dereference(f->f_op);

error = security_dentry_open(f, cred);
if (error)
goto cleanup_all;

- if (!open && f->f_op)
- open = f->f_op->open;
+ if (!open && f_op)
+ open = f_op->open;
if (open) {
error = open(inode, f);
if (error)
goto cleanup_all;
}
+ spin_lock(&f->f_lock);
+ f->f_mode |= FMODE_RELEASE;
+ spin_unlock(&f->f_lock);
+ fops_read_unlock(f, fops_idx);

f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);

@@ -860,6 +872,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
return f;

cleanup_all:
+ fops_read_unlock(f, fops_idx);
fops_put(f->f_op);
if (f->f_mode & FMODE_WRITE) {
put_write_access(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7805d20..a82a2ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -78,6 +78,13 @@ struct inodes_stat_t {
/* File is opened using open(.., 3, ..) and is writeable only for ioctls
(specialy hack for floppy.c) */
#define FMODE_WRITE_IOCTL ((__force fmode_t)256)
+/* File release method needs to be called */
+#define FMODE_RELEASE ((__force fmode_t)512)
+/*
+ * The file descriptor has been denied access to the original object.
+ * Likely a module removal, or device has been unplugged.
+ */
+#define FMODE_REVOKED ((__force fmode_t)1024)

/*
* Don't update ctime and mtime.
@@ -329,6 +336,7 @@ struct kstatfs;
struct vm_area_struct;
struct vfsmount;
struct cred;
+struct vm_operations_struct;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -856,6 +864,7 @@ struct file {
const struct file_operations *f_op;
spinlock_t f_lock; /* f_ep_links, f_flags, no IRQ */
atomic_long_t f_count;
+ atomic_long_t f_use; /* f_op, private_data */
unsigned int f_flags;
fmode_t f_mode;
loff_t f_pos;
@@ -879,6 +888,14 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
+
+extern int fops_read_lock(struct file *file);
+extern void fops_read_unlock(struct file *file, int idx);
+extern int fops_substitute(struct file *file, const struct file_operations *f_op,
+ struct vm_operations_struct *vm_ops);
+extern void inode_fops_substitute(struct inode *inode,
+ const struct file_operations *f_op, struct vm_operations_struct *vm_ops);
+
extern struct mutex files_lock;
#define file_list_lock() mutex_lock(&files_lock);
#define file_list_unlock() mutex_unlock(&files_lock);
@@ -1452,6 +1469,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ int (*awaken_all_waiters)(struct file *);
};

struct inode_operations {
@@ -2116,6 +2134,7 @@ static inline void insert_inode_hash(struct inode *inode) {
}

extern struct file * get_empty_filp(void);
+extern void __file_add(struct file *f, struct list_head *list);
extern void file_add(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
#ifdef CONFIG_BLOCK
--
1.6.1.2.350.g88cc

2009-04-11 12:11:21

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 6/9] vfs: Utilize fops_read_lock where appropriate


Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/compat.c | 31 ++++++++----
fs/fcntl.c | 32 ++++++++----
fs/ioctl.c | 39 +++++++++------
fs/locks.c | 81 +++++++++++++++++++++++++------
fs/open.c | 12 ++++-
fs/read_write.c | 143 ++++++++++++++++++++++++++++++++++++++++++-------------
fs/readdir.c | 14 ++++-
fs/select.c | 17 +++++-
8 files changed, 276 insertions(+), 93 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 3f84d5f..a73ca0d 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1090,6 +1090,7 @@ out:
#endif /* ! __ARCH_OMIT_COMPAT_SYS_GETDENTS64 */

static ssize_t compat_do_readv_writev(int type, struct file *file,
+ const struct file_operations *f_op,
const struct compat_iovec __user *uvector,
unsigned long nr_segs, loff_t *pos)
{
@@ -1117,7 +1118,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
ret = -EINVAL;
if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0))
goto out;
- if (!file->f_op)
+ if (!f_op)
goto out;
if (nr_segs > UIO_FASTIOV) {
ret = -ENOMEM;
@@ -1170,11 +1171,11 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,

fnv = NULL;
if (type == READ) {
- fn = file->f_op->read;
- fnv = file->f_op->aio_read;
+ fn = f_op->read;
+ fnv = f_op->aio_read;
} else {
- fn = (io_fn_t)file->f_op->write;
- fnv = file->f_op->aio_write;
+ fn = (io_fn_t)f_op->write;
+ fnv = f_op->aio_write;
}

if (fnv)
@@ -1200,21 +1201,27 @@ static size_t compat_readv(struct file *file,
const struct compat_iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
ssize_t ret = -EBADF;

+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
if (!(file->f_mode & FMODE_READ))
goto out;

ret = -EINVAL;
- if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
+ if (!f_op || (!f_op->aio_read && !f_op->read))
goto out;

- ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(READ, file, f_op, vec, vlen, pos);

out:
if (ret > 0)
add_rchar(current, ret);
inc_syscr(current);
+ fops_read_unlock(file, fops_idx);
return ret;
}

@@ -1257,21 +1264,27 @@ static size_t compat_writev(struct file *file,
const struct compat_iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
ssize_t ret = -EBADF;

+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
if (!(file->f_mode & FMODE_WRITE))
goto out;

ret = -EINVAL;
- if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
+ if (!f_op || (!f_op->aio_write && !f_op->write))
goto out;

- ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(WRITE, file, f_op, vec, vlen, pos);

out:
if (ret > 0)
add_wchar(current, ret);
inc_syscw(current);
+ fops_read_unlock(file, fops_idx);
return ret;
}

diff --git a/fs/fcntl.c b/fs/fcntl.c
index cc8e4de..2718aea 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -146,42 +146,51 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes)
static int setfl(int fd, struct file * filp, unsigned long arg)
{
struct inode * inode = filp->f_path.dentry->d_inode;
- int error = 0;
+ const struct file_operations *f_op;
+ int fops_idx;
+ int error;
+
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);

/*
* O_APPEND cannot be cleared if the file is marked as append-only
* and the file is open for write.
*/
+ error = -EPERM;
if (((arg ^ filp->f_flags) & O_APPEND) && IS_APPEND(inode))
- return -EPERM;
+ goto out;

/* O_NOATIME can only be set by the owner or superuser */
+ error = -EPERM;
if ((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME))
if (!is_owner_or_cap(inode))
- return -EPERM;
+ goto out;

/* required for strict SunOS emulation */
if (O_NONBLOCK != O_NDELAY)
if (arg & O_NDELAY)
arg |= O_NONBLOCK;

+ error = -EINVAL;
if (arg & O_DIRECT) {
if (!filp->f_mapping || !filp->f_mapping->a_ops ||
- !filp->f_mapping->a_ops->direct_IO)
- return -EINVAL;
+ !filp->f_mapping->a_ops->direct_IO)
+ goto out;
}

- if (filp->f_op && filp->f_op->check_flags)
- error = filp->f_op->check_flags(arg);
+ error = 0;
+ if (f_op && f_op->check_flags)
+ error = f_op->check_flags(arg);
if (error)
- return error;
+ goto out;

/*
* ->fasync() is responsible for setting the FASYNC bit.
*/
- if (((arg ^ filp->f_flags) & FASYNC) && filp->f_op &&
- filp->f_op->fasync) {
- error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 0);
+ if (((arg ^ filp->f_flags) & FASYNC) && f_op &&
+ f_op->fasync) {
+ error = f_op->fasync(fd, filp, (arg & FASYNC) != 0);
if (error < 0)
goto out;
if (error > 0)
@@ -192,6 +201,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
spin_unlock(&filp->f_lock);

out:
+ fops_read_unlock(filp, fops_idx);
return error;
}

diff --git a/fs/ioctl.c b/fs/ioctl.c
index ac2d47e..158030b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -33,23 +33,23 @@
*
* Returns 0 on success, -errno on error.
*/
-static long vfs_ioctl(struct file *filp, unsigned int cmd,
- unsigned long arg)
+static long vfs_ioctl(struct file *filp, const struct file_operations *f_op,
+ unsigned int cmd, unsigned long arg)
{
int error = -ENOTTY;

- if (!filp->f_op)
+ if (!f_op)
goto out;

- if (filp->f_op->unlocked_ioctl) {
- error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
+ if (f_op->unlocked_ioctl) {
+ error = f_op->unlocked_ioctl(filp, cmd, arg);
if (error == -ENOIOCTLCMD)
error = -EINVAL;
goto out;
- } else if (filp->f_op->ioctl) {
+ } else if (f_op->ioctl) {
lock_kernel();
- error = filp->f_op->ioctl(filp->f_path.dentry->d_inode,
- filp, cmd, arg);
+ error = f_op->ioctl(filp->f_path.dentry->d_inode, filp,
+ cmd, arg);
unlock_kernel();
}

@@ -370,8 +370,8 @@ EXPORT_SYMBOL(generic_block_fiemap);

#endif /* CONFIG_BLOCK */

-static int file_ioctl(struct file *filp, unsigned int cmd,
- unsigned long arg)
+static int file_ioctl(struct file *filp, const struct file_operations *f_op,
+ unsigned int cmd, unsigned long arg)
{
struct inode *inode = filp->f_path.dentry->d_inode;
int __user *p = (int __user *)arg;
@@ -387,7 +387,7 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
return put_user(i_size_read(inode) - filp->f_pos, p);
}

- return vfs_ioctl(filp, cmd, arg);
+ return vfs_ioctl(filp, f_op, cmd, arg);
}

static int ioctl_fionbio(struct file *filp, int __user *argp)
@@ -414,6 +414,7 @@ static int ioctl_fionbio(struct file *filp, int __user *argp)
}

static int ioctl_fioasync(unsigned int fd, struct file *filp,
+ const struct file_operations *f_op,
int __user *argp)
{
unsigned int flag;
@@ -426,9 +427,9 @@ static int ioctl_fioasync(unsigned int fd, struct file *filp,

/* Did FASYNC state change ? */
if ((flag ^ filp->f_flags) & FASYNC) {
- if (filp->f_op && filp->f_op->fasync)
+ if (f_op && f_op->fasync)
/* fasync() adjusts filp->f_flags */
- error = filp->f_op->fasync(fd, filp, on);
+ error = f_op->fasync(fd, filp, on);
else
error = -ENOTTY;
}
@@ -482,9 +483,14 @@ static int ioctl_fsthaw(struct file *filp)
int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg)
{
+ const struct file_operations *f_op;
+ int fops_idx;
int error = 0;
int __user *argp = (int __user *)arg;

+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
switch (cmd) {
case FIOCLEX:
set_close_on_exec(fd, 1);
@@ -499,7 +505,7 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
break;

case FIOASYNC:
- error = ioctl_fioasync(fd, filp, argp);
+ error = ioctl_fioasync(fd, filp, f_op, argp);
break;

case FIOQSIZE:
@@ -524,11 +530,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,

default:
if (S_ISREG(filp->f_path.dentry->d_inode->i_mode))
- error = file_ioctl(filp, cmd, arg);
+ error = file_ioctl(filp, f_op, cmd, arg);
else
- error = vfs_ioctl(filp, cmd, arg);
+ error = vfs_ioctl(filp, f_op, cmd, arg);
break;
}
+ fops_read_unlock(filp, fops_idx);
return error;
}

diff --git a/fs/locks.c b/fs/locks.c
index ec3deea..5ff959e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1463,15 +1463,21 @@ EXPORT_SYMBOL(generic_setlease);

int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
{
+ const struct file_operations *f_op;
+ int fops_idx;
int error;

+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
lock_kernel();
- if (filp->f_op && filp->f_op->setlease)
- error = filp->f_op->setlease(filp, arg, lease);
+ if (f_op && f_op->setlease)
+ error = f_op->setlease(filp, arg, lease);
else
error = generic_setlease(filp, arg, lease);
unlock_kernel();

+ fops_read_unlock(filp, fops_idx);
return error;
}
EXPORT_SYMBOL_GPL(vfs_setlease);
@@ -1566,9 +1572,11 @@ EXPORT_SYMBOL(flock_lock_file_wait);
*/
SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
{
+ const struct file_operations *f_op;
struct file *filp;
struct file_lock *lock;
int can_sleep, unlock;
+ int fops_idx;
int error;

error = -EBADF;
@@ -1594,13 +1602,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
if (error)
goto out_free;

- if (filp->f_op && filp->f_op->flock)
- error = filp->f_op->flock(filp,
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->flock)
+ error = f_op->flock(filp,
(can_sleep) ? F_SETLKW : F_SETLK,
lock);
else
error = flock_lock_file_wait(filp, lock);

+ fops_read_unlock(filp, fops_idx);
+
out_free:
locks_free_lock(lock);

@@ -1620,10 +1633,20 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
*/
int vfs_test_lock(struct file *filp, struct file_lock *fl)
{
- if (filp->f_op && filp->f_op->lock)
- return filp->f_op->lock(filp, F_GETLK, fl);
- posix_test_lock(filp, fl);
- return 0;
+ const struct file_operations *f_op;
+ int fops_idx;
+ int ret = 0;
+
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->lock)
+ ret = filp->f_op->lock(filp, F_GETLK, fl);
+ else
+ posix_test_lock(filp, fl);
+
+ fops_read_unlock(filp, fops_idx);
+ return ret;
}
EXPORT_SYMBOL_GPL(vfs_test_lock);

@@ -1732,10 +1755,20 @@ out:
*/
int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, struct file_lock *conf)
{
- if (filp->f_op && filp->f_op->lock)
- return filp->f_op->lock(filp, cmd, fl);
+ const struct file_operations *f_op;
+ int fops_idx;
+ int ret;
+
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->lock)
+ ret = f_op->lock(filp, cmd, fl);
else
- return posix_lock_file(filp, fl, conf);
+ ret = posix_lock_file(filp, fl, conf);
+
+ fops_read_unlock(filp, fops_idx);
+ return ret;
}
EXPORT_SYMBOL_GPL(vfs_lock_file);

@@ -1999,13 +2032,18 @@ EXPORT_SYMBOL(locks_remove_posix);
void locks_remove_flock(struct file *filp)
{
struct inode * inode = filp->f_path.dentry->d_inode;
+ const struct file_operations *f_op;
struct file_lock *fl;
struct file_lock **before;
+ int fops_idx;

if (!inode->i_flock)
return;

- if (filp->f_op && filp->f_op->flock) {
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->flock) {
struct file_lock fl = {
.fl_pid = current->tgid,
.fl_file = filp,
@@ -2013,11 +2051,13 @@ void locks_remove_flock(struct file *filp)
.fl_type = F_UNLCK,
.fl_end = OFFSET_MAX,
};
- filp->f_op->flock(filp, F_SETLKW, &fl);
+ f_op->flock(filp, F_SETLKW, &fl);
if (fl.fl_ops && fl.fl_ops->fl_release_private)
fl.fl_ops->fl_release_private(&fl);
}

+ fops_read_unlock(filp, fops_idx);
+
lock_kernel();
before = &inode->i_flock;

@@ -2071,9 +2111,18 @@ EXPORT_SYMBOL(posix_unblock_lock);
*/
int vfs_cancel_lock(struct file *filp, struct file_lock *fl)
{
- if (filp->f_op && filp->f_op->lock)
- return filp->f_op->lock(filp, F_CANCELLK, fl);
- return 0;
+ const struct file_operations *f_op;
+ int fops_idx;
+ int ret = 0;
+
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->lock)
+ ret = f_op->lock(filp, F_CANCELLK, fl);
+
+ fops_read_unlock(filp, fops_idx);
+ return ret;
}

EXPORT_SYMBOL_GPL(vfs_cancel_lock);
diff --git a/fs/open.c b/fs/open.c
index 0b75dde..67031e7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -398,6 +398,7 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
goto out;
if (!(file->f_mode & FMODE_WRITE))
goto out_fput;
+
/*
* Revalidate the write permissions, in case security policy has
* changed since the files were opened.
@@ -1107,6 +1108,8 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, int, mode)
*/
int filp_close(struct file *filp, fl_owner_t id)
{
+ const struct file_operations *f_op;
+ int fops_idx;
int retval = 0;

if (!file_count(filp)) {
@@ -1114,8 +1117,13 @@ int filp_close(struct file *filp, fl_owner_t id)
return 0;
}

- if (filp->f_op && filp->f_op->flush)
- retval = filp->f_op->flush(filp, id);
+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
+ if (f_op && f_op->flush)
+ retval = f_op->flush(filp, id);
+
+ fops_read_unlock(filp, fops_idx);

dnotify_flush(filp, id);
locks_remove_posix(filp, id);
diff --git a/fs/read_write.c b/fs/read_write.c
index 9d1e76b..4def2ee 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -136,14 +136,23 @@ EXPORT_SYMBOL(default_llseek);
loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
{
loff_t (*fn)(struct file *, loff_t, int);
+ const struct file_operations *f_op;
+ int fops_idx;
+ loff_t ret;
+
+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);

fn = no_llseek;
if (file->f_mode & FMODE_LSEEK) {
fn = default_llseek;
- if (file->f_op && file->f_op->llseek)
- fn = file->f_op->llseek;
+ if (f_op && f_op->llseek)
+ fn = f_op->llseek;
}
- return fn(file, offset, origin);
+ ret = fn(file, offset, origin);
+
+ fops_read_unlock(file, fops_idx);
+ return ret;
}
EXPORT_SYMBOL(vfs_llseek);

@@ -252,15 +261,20 @@ static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_base = buf, .iov_len = len };
+ const struct file_operations *f_op;
struct kiocb kiocb;
+ int fops_idx;
ssize_t ret;

+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_left = len;

for (;;) {
- ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
+ ret = f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
@@ -269,6 +283,8 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
+ fops_read_unlock(filp, fops_idx);
+
return ret;
}

@@ -276,20 +292,28 @@ EXPORT_SYMBOL(do_sync_read);

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
ssize_t ret;

+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
- if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
- return -EINVAL;
+ goto out;
+ ret = -EINVAL;
+ if (!f_op || (!f_op->read && !f_op->aio_read))
+ goto out;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
- return -EFAULT;
+ goto out;

ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
count = ret;
- if (file->f_op->read)
- ret = file->f_op->read(file, buf, count, pos);
+ if (f_op->read)
+ ret = f_op->read(file, buf, count, pos);
else
ret = do_sync_read(file, buf, count, pos);
if (ret > 0) {
@@ -298,7 +322,8 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
}
inc_syscr(current);
}
-
+out:
+ fops_read_unlock(file, fops_idx);
return ret;
}

@@ -307,15 +332,20 @@ EXPORT_SYMBOL(vfs_read);
ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
+ const struct file_operations *f_op;
struct kiocb kiocb;
+ int fops_idx;
ssize_t ret;

+ fops_idx = fops_read_lock(filp);
+ f_op = rcu_dereference(filp->f_op);
+
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_left = len;

for (;;) {
- ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
+ ret = f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
@@ -324,6 +354,8 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
+
+ fops_read_unlock(filp, fops_idx);
return ret;
}

@@ -331,20 +363,28 @@ EXPORT_SYMBOL(do_sync_write);

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
ssize_t ret;

+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
- if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
- return -EINVAL;
+ goto out;
+ ret = -EINVAL;
+ if (!f_op || (!f_op->write && !f_op->aio_write))
+ goto out;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
- return -EFAULT;
+ goto out;

ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
count = ret;
- if (file->f_op->write)
- ret = file->f_op->write(file, buf, count, pos);
+ if (f_op->write)
+ ret = f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
if (ret > 0) {
@@ -354,6 +394,8 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
inc_syscw(current);
}

+out:
+ fops_read_unlock(file, fops_idx);
return ret;
}

@@ -611,6 +653,7 @@ out:
}

static ssize_t do_readv_writev(int type, struct file *file,
+ const struct file_operations *f_op,
const struct iovec __user * uvector,
unsigned long nr_segs, loff_t *pos)
{
@@ -621,7 +664,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
io_fn_t fn;
iov_fn_t fnv;

- if (!file->f_op) {
+ if (!f_op) {
ret = -EINVAL;
goto out;
}
@@ -638,11 +681,11 @@ static ssize_t do_readv_writev(int type, struct file *file,

fnv = NULL;
if (type == READ) {
- fn = file->f_op->read;
- fnv = file->f_op->aio_read;
+ fn = f_op->read;
+ fnv = f_op->aio_read;
} else {
- fn = (io_fn_t)file->f_op->write;
- fnv = file->f_op->aio_write;
+ fn = (io_fn_t)f_op->write;
+ fnv = f_op->aio_write;
}

if (fnv)
@@ -666,12 +709,25 @@ out:
ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
+ ssize_t ret;
+
+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
- if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
- return -EINVAL;
+ goto out;
+ ret = -EINVAL;
+ if (!f_op || (!f_op->aio_read && !f_op->read))
+ goto out;
+
+ ret = do_readv_writev(READ, file, f_op, vec, vlen, pos);

- return do_readv_writev(READ, file, vec, vlen, pos);
+out:
+ fops_read_unlock(file, fops_idx);
+ return ret;
}

EXPORT_SYMBOL(vfs_readv);
@@ -679,12 +735,25 @@ EXPORT_SYMBOL(vfs_readv);
ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ const struct file_operations *f_op;
+ int fops_idx;
+ ssize_t ret;
+
+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
- if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
- return -EINVAL;
+ goto out;
+ ret = -EINVAL;
+ if (!f_op || (!f_op->aio_write && !f_op->write))
+ goto out;
+
+ ret = do_readv_writev(WRITE, file, f_op, vec, vlen, pos);

- return do_readv_writev(WRITE, file, vec, vlen, pos);
+out:
+ fops_read_unlock(file, fops_idx);
+ return ret;
}

EXPORT_SYMBOL(vfs_writev);
@@ -790,8 +859,10 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
+ const struct file_operations *in_f_op, *out_f_op;
struct file * in_file, * out_file;
struct inode * in_inode, * out_inode;
+ int in_fops_idx, out_fops_idx;
loff_t pos;
ssize_t retval;
int fput_needed_in, fput_needed_out, fl;
@@ -803,13 +874,15 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
in_file = fget_light(in_fd, &fput_needed_in);
if (!in_file)
goto out;
+ in_fops_idx = fops_read_lock(in_file);
if (!(in_file->f_mode & FMODE_READ))
goto fput_in;
retval = -EINVAL;
in_inode = in_file->f_path.dentry->d_inode;
if (!in_inode)
goto fput_in;
- if (!in_file->f_op || !in_file->f_op->splice_read)
+ in_f_op = rcu_dereference(in_file->f_op);
+ if (!in_f_op || !in_f_op->splice_read)
goto fput_in;
retval = -ESPIPE;
if (!ppos)
@@ -829,10 +902,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
out_file = fget_light(out_fd, &fput_needed_out);
if (!out_file)
goto fput_in;
+ out_fops_idx = fops_read_lock(out_file);
if (!(out_file->f_mode & FMODE_WRITE))
goto fput_out;
retval = -EINVAL;
- if (!out_file->f_op || !out_file->f_op->sendpage)
+ out_f_op = rcu_dereference(out_file->f_op);
+ if (!out_f_op || !out_f_op->sendpage)
goto fput_out;
out_inode = out_file->f_path.dentry->d_inode;
retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
@@ -878,8 +953,10 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
retval = -EOVERFLOW;

fput_out:
+ fops_read_unlock(out_file, out_fops_idx);
fput_light(out_file, fput_needed_out);
fput_in:
+ fops_read_unlock(in_file, in_fops_idx);
fput_light(in_file, fput_needed_in);
out:
return retval;
diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..6017fa6 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,9 +21,16 @@

int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
- struct inode *inode = file->f_path.dentry->d_inode;
+ const struct file_operations *f_op;
+ struct inode *inode;
+ int fops_idx;
int res = -ENOTDIR;
- if (!file->f_op || !file->f_op->readdir)
+
+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
+ inode = file->f_path.dentry->d_inode;
+ if (!f_op || !f_op->readdir)
goto out;

res = security_file_permission(file, MAY_READ);
@@ -36,11 +43,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
- res = file->f_op->readdir(file, buf, filler);
+ res = f_op->readdir(file, buf, filler);
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
out:
+ fops_read_unlock(file, fops_idx);
return res;
}

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..8f736a9 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -416,10 +416,12 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
continue;
file = fget_light(i, &fput_needed);
if (file) {
- f_op = file->f_op;
+ int fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
mask = DEFAULT_POLLMASK;
if (f_op && f_op->poll)
mask = (*f_op->poll)(file, retval ? NULL : wait);
+ fops_read_unlock(file, fops_idx);
fput_light(file, fput_needed);
if ((mask & POLLIN_SET) && (in & bit)) {
res_in |= bit;
@@ -684,11 +686,20 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
file = fget_light(fd, &fput_needed);
mask = POLLNVAL;
if (file != NULL) {
+ const struct file_operations *f_op;
+ int fops_idx;
+
+ fops_idx = fops_read_lock(file);
+ f_op = rcu_dereference(file->f_op);
+
mask = DEFAULT_POLLMASK;
- if (file->f_op && file->f_op->poll)
- mask = file->f_op->poll(file, pwait);
+ if (f_op && f_op->poll)
+ mask = f_op->poll(file, pwait);
+
/* Mask out unneeded events. */
mask &= pollfd->events | POLLERR | POLLHUP;
+
+ fops_read_unlock(file, fops_idx);
fput_light(file, fput_needed);
}
}
--
1.6.1.2.350.g88cc

2009-04-11 12:12:22

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 7/9] vfs: Optimize fops_read_lock


After seeing fget_light and the justification for it in the commit
log I did not want to introduce something into the common read/write
path of file descriptors that could have a significant measurable impact
on I/O speed.

commit f6435db01709533f270b2dce1e5914770dbc65de
Author: akpm <akpm>
Date: Thu May 8 05:19:50 2003 +0000

[PATCH] reduced overheads in fget/fput

From: Dipankar Sarma <[email protected]>

fget() shows up on profiles, especially on SMP. Dipankar's patch
special-cases the situation wherein there are no sharers of current->files.

In this situation we know that no other process can close this file, so it
is not necessary to increment the file's refcount.

It's ugly as sin, but makes a substantial difference.

The test is

dd if=/dev/zero of=foo bs=1 count=1M

On 4CPU P3 xeon with 1MB L2 cache and 512MB ram:

kernel sys time std-dev
------------ -------- -------

UP - vanilla 2.104 0.028
UP - file 1.867 0.019

SMP - vanilla 2.976 0.023
SMP - file 2.719 0.026

BKrev: 3eb9e8f6Db0nMWoSx5IdHx6SBal8aw

My test case was:
dd if=/dev/zero of=/dev/null bs=1 count=1M.

As writing to a real file turned out to cover the cost.

Without the optimization I am seeing 2.4 - 2.5 MB/s on my
idle core2 single socket quad core.

With this optimization I am seeing 2.9 - 3.0 MB/s on the
same machine. Maybe 2% slower than before I introduced
fops_read_lock.

The common case is that there is only one thread and so
the fget_light optimization applies and f_count remains
at 1. Which implies that there is only a single process
performing operations through the file descriptor.

In that case because there is no possible contention
it is possible safely skip the atomic operations, gaining all
of the benefits of rcu without requiring a per cpu variable.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/file_table.c | 18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index d216557..634d44c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -495,15 +495,25 @@ void file_kill(struct file *file)
int fops_read_lock(struct file *file)
{
int revoked = (file->f_mode & FMODE_REVOKED);
- if (likely(!revoked))
- atomic_long_inc(&file->f_use);
+ if (likely(!revoked)) {
+ if (likely(atomic_long_read(&file->f_count) == 1))
+ atomic_long_set(&file->f_use,
+ atomic_long_read(&file->f_use) + 1);
+ else
+ atomic_long_inc(&file->f_use);
+ }
return revoked;
}

void fops_read_unlock(struct file *file, int revoked)
{
- if (likely(!revoked))
- atomic_long_dec(&file->f_use);
+ if (likely(!revoked)) {
+ if (likely(atomic_long_read(&file->f_count) == 1))
+ atomic_long_set(&file->f_use,
+ atomic_long_read(&file->f_use) - 1);
+ else
+ atomic_long_dec(&file->f_use);
+ }
}

int fs_may_remount_ro(struct super_block *sb)
--
1.6.1.2.350.g88cc

2009-04-11 12:13:40

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations


revoked_file_ops is a set of file operations designed to be used
when a files backing store has been removed.

revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
always ready for I/O and return -EIO from all other operations.

This is designed to allow userspace to gracefully file descriptors
that enter this unusable state.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/Makefile | 2 +-
fs/revoked_file.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 +
3 files changed, 184 insertions(+), 1 deletions(-)
create mode 100644 fs/revoked_file.c

diff --git a/fs/Makefile b/fs/Makefile
index af6d047..7787ddd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o fs_struct.o
+ stack.o fs_struct.o revoked_file.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/revoked_file.c b/fs/revoked_file.c
new file mode 100644
index 0000000..9936693
--- /dev/null
+++ b/fs/revoked_file.c
@@ -0,0 +1,181 @@
+/*
+ * linux/fs/revoked_file.c
+ *
+ * Copyright (C) 1997, Stephen Tweedie
+ *
+ * Provide stub functions for unreadable inodes
+ *
+ * Fabian Frederick : August 2003 - All file operations assigned to EIO
+ *
+ * Eric Biederman : 8 April 2008 - Derivied from bad_inode.c
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/stat.h>
+#include <linux/time.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+
+static loff_t revoked_file_llseek(struct file *file, loff_t offset, int origin)
+{
+ return -EIO;
+}
+
+static ssize_t revoked_file_read(struct file *filp, char __user *buf,
+ size_t size, loff_t *ppos)
+{
+ return 0;
+}
+
+static ssize_t revoked_file_write(struct file *filp, const char __user *buf,
+ size_t siz, loff_t *ppos)
+{
+ return -EIO;
+}
+
+static ssize_t revoked_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ return 0;
+}
+
+static ssize_t revoked_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ return -EIO;
+}
+
+static int revoked_file_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+ return -EIO;
+}
+
+static unsigned int revoked_file_poll(struct file *filp, poll_table *wait)
+{
+ return POLLIN | POLLOUT | POLLERR | POLLRDNORM | POLLWRNORM;
+}
+
+static int revoked_file_ioctl (struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ return -EIO;
+}
+
+static long revoked_file_unlocked_ioctl(struct file *file, unsigned cmd,
+ unsigned long arg)
+{
+ return -EIO;
+}
+
+static long revoked_file_compat_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ return -EIO;
+}
+
+static int revoked_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return -EIO;
+}
+
+static int revoked_file_open(struct inode *inode, struct file *filp)
+{
+ return -EIO;
+}
+
+static int revoked_file_flush(struct file *file, fl_owner_t id)
+{
+ return 0;
+}
+
+static int revoked_file_release(struct inode *inode, struct file *filp)
+{
+ return 0;
+}
+
+static int revoked_file_fsync(struct file *file, struct dentry *dentry,
+ int datasync)
+{
+ return -EIO;
+}
+
+static int revoked_file_aio_fsync(struct kiocb *iocb, int datasync)
+{
+ return -EIO;
+}
+
+static int revoked_file_fasync(int fd, struct file *filp, int on)
+{
+ return -EIO;
+}
+
+static int revoked_file_lock(struct file *file, int cmd, struct file_lock *fl)
+{
+ return -EIO;
+}
+
+static ssize_t revoked_file_sendpage(struct file *file, struct page *page,
+ int off, size_t len, loff_t *pos, int more)
+{
+ return -EIO;
+}
+
+static unsigned long revoked_file_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return -EIO;
+}
+
+static int revoked_file_check_flags(int flags)
+{
+ return -EIO;
+}
+
+static int revoked_file_flock(struct file *filp, int cmd, struct file_lock *fl)
+{
+ return -EIO;
+}
+
+static ssize_t revoked_file_splice_write(struct pipe_inode_info *pipe,
+ struct file *out, loff_t *ppos, size_t len,
+ unsigned int flags)
+{
+ return -EIO;
+}
+
+static ssize_t revoked_file_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
+{
+ return -EIO;
+}
+
+const struct file_operations revoked_file_ops =
+{
+ .llseek = revoked_file_llseek,
+ .read = revoked_file_read,
+ .write = revoked_file_write,
+ .aio_read = revoked_file_aio_read,
+ .aio_write = revoked_file_aio_write,
+ .readdir = revoked_file_readdir,
+ .poll = revoked_file_poll,
+ .ioctl = revoked_file_ioctl,
+ .unlocked_ioctl = revoked_file_unlocked_ioctl,
+ .compat_ioctl = revoked_file_compat_ioctl,
+ .mmap = revoked_file_mmap,
+ .open = revoked_file_open,
+ .flush = revoked_file_flush,
+ .release = revoked_file_release,
+ .fsync = revoked_file_fsync,
+ .aio_fsync = revoked_file_aio_fsync,
+ .fasync = revoked_file_fasync,
+ .lock = revoked_file_lock,
+ .sendpage = revoked_file_sendpage,
+ .get_unmapped_area = revoked_file_get_unmapped_area,
+ .check_flags = revoked_file_check_flags,
+ .flock = revoked_file_flock,
+ .splice_write = revoked_file_splice_write,
+ .splice_read = revoked_file_splice_read,
+};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a82a2ea..2fb0871 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -896,6 +896,8 @@ extern int fops_substitute(struct file *file, const struct file_operations *f_op
extern void inode_fops_substitute(struct inode *inode,
const struct file_operations *f_op, struct vm_operations_struct *vm_ops);

+extern const struct file_operations revoked_file_ops;
+
extern struct mutex files_lock;
#define file_list_lock() mutex_lock(&files_lock);
#define file_list_unlock() mutex_unlock(&files_lock);
--
1.6.1.2.350.g88cc

2009-04-11 12:15:14

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 9/9] proc: Use the generic vfs revoke facility that now exists.


Doing this the code becomes much simpler and more robust.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/generic.c | 100 +++++----------
fs/proc/inode.c | 339 +----------------------------------------------
fs/proc/internal.h | 2 +
fs/proc/root.c | 2 +-
include/linux/proc_fs.h | 4 -
5 files changed, 36 insertions(+), 411 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index fa678ab..5453114 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -20,6 +20,8 @@
#include <linux/bitops.h>
#include <linux/spinlock.h>
#include <linux/completion.h>
+#include <linux/file.h>
+#include <linux/mm.h>
#include <asm/uaccess.h>

#include "internal.h"
@@ -37,7 +39,7 @@ static int proc_match(int len, const char *name, struct proc_dir_entry *de)
#define PROC_BLOCK_SIZE (PAGE_SIZE - 1024)

static ssize_t
-__proc_file_read(struct file *file, char __user *buf, size_t nbytes,
+proc_file_read(struct file *file, char __user *buf, size_t nbytes,
loff_t *ppos)
{
struct inode * inode = file->f_path.dentry->d_inode;
@@ -183,27 +185,6 @@ __proc_file_read(struct file *file, char __user *buf, size_t nbytes,
}

static ssize_t
-proc_file_read(struct file *file, char __user *buf, size_t nbytes,
- loff_t *ppos)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- ssize_t rv = -EIO;
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- spin_unlock(&pde->pde_unload_lock);
-
- rv = __proc_file_read(file, buf, nbytes, ppos);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static ssize_t
proc_file_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
@@ -211,17 +192,8 @@ proc_file_write(struct file *file, const char __user *buffer,
ssize_t rv = -EIO;

if (pde->write_proc) {
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- spin_unlock(&pde->pde_unload_lock);
-
/* FIXME: does this routine need ppos? probably... */
rv = pde->write_proc(file, buffer, count, pde->data);
- pde_users_dec(pde);
}
return rv;
}
@@ -630,10 +602,6 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
ent->mode = mode;
ent->nlink = nlink;
atomic_set(&ent->count, 1);
- ent->pde_users = 0;
- spin_lock_init(&ent->pde_unload_lock);
- ent->pde_unload_completion = NULL;
- INIT_LIST_HEAD(&ent->pde_openers);
out:
return ent;
}
@@ -777,6 +745,33 @@ void free_proc_entry(struct proc_dir_entry *de)
kfree(de);
}

+static struct inode *get_pde_inode(struct proc_dir_entry *de)
+{
+ struct inode *inode = NULL;
+ struct super_block *sb;
+
+ spin_lock(&sb_lock);
+ list_for_each_entry(sb, &proc_fs_type.fs_supers, s_instances) {
+ inode = ilookup(sb, de->low_ino);
+ if (inode && inode->i_fop != &revoked_file_ops)
+ break;
+ iput(inode);
+ inode = NULL;
+ }
+ spin_unlock(&sb_lock);
+ return inode;
+}
+
+static void proc_revoke_pde(struct proc_dir_entry *de)
+{
+ struct inode *inode;
+
+ while ((inode = get_pde_inode(de))) {
+ inode_fops_substitute(inode, &revoked_file_ops, &revoked_vm_ops);
+ iput(inode);
+ }
+}
+
/*
* Remove a /proc entry and free it if it's not currently in use.
*/
@@ -804,40 +799,7 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
if (!de)
return;

- spin_lock(&de->pde_unload_lock);
- /*
- * Stop accepting new callers into module. If you're
- * dynamically allocating ->proc_fops, save a pointer somewhere.
- */
- de->proc_fops = NULL;
- /* Wait until all existing callers into module are done. */
- if (de->pde_users > 0) {
- DECLARE_COMPLETION_ONSTACK(c);
-
- if (!de->pde_unload_completion)
- de->pde_unload_completion = &c;
-
- spin_unlock(&de->pde_unload_lock);
-
- wait_for_completion(de->pde_unload_completion);
-
- goto continue_removing;
- }
- spin_unlock(&de->pde_unload_lock);
-
-continue_removing:
- spin_lock(&de->pde_unload_lock);
- while (!list_empty(&de->pde_openers)) {
- struct pde_opener *pdeo;
-
- pdeo = list_first_entry(&de->pde_openers, struct pde_opener, lh);
- list_del(&pdeo->lh);
- spin_unlock(&de->pde_unload_lock);
- pdeo->release(pdeo->inode, pdeo->file);
- kfree(pdeo);
- spin_lock(&de->pde_unload_lock);
- }
- spin_unlock(&de->pde_unload_lock);
+ proc_revoke_pde(de);

if (S_ISDIR(de->mode))
parent->nlink--;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index d78ade3..aa7e629 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -117,330 +117,6 @@ static const struct super_operations proc_sops = {
.statfs = simple_statfs,
};

-static void __pde_users_dec(struct proc_dir_entry *pde)
-{
- pde->pde_users--;
- if (pde->pde_unload_completion && pde->pde_users == 0)
- complete(pde->pde_unload_completion);
-}
-
-void pde_users_dec(struct proc_dir_entry *pde)
-{
- spin_lock(&pde->pde_unload_lock);
- __pde_users_dec(pde);
- spin_unlock(&pde->pde_unload_lock);
-}
-
-static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- loff_t rv = -EINVAL;
- loff_t (*llseek)(struct file *, loff_t, int);
-
- spin_lock(&pde->pde_unload_lock);
- /*
- * remove_proc_entry() is going to delete PDE (as part of module
- * cleanup sequence). No new callers into module allowed.
- */
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- /*
- * Bump refcount so that remove_proc_entry will wail for ->llseek to
- * complete.
- */
- pde->pde_users++;
- /*
- * Save function pointer under lock, to protect against ->proc_fops
- * NULL'ifying right after ->pde_unload_lock is dropped.
- */
- llseek = pde->proc_fops->llseek;
- spin_unlock(&pde->pde_unload_lock);
-
- if (!llseek)
- llseek = default_llseek;
- rv = llseek(file, offset, whence);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static ssize_t proc_reg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- ssize_t rv = -EIO;
- ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- read = pde->proc_fops->read;
- spin_unlock(&pde->pde_unload_lock);
-
- if (read)
- rv = read(file, buf, count, ppos);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static ssize_t proc_reg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- ssize_t rv = -EIO;
- ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- write = pde->proc_fops->write;
- spin_unlock(&pde->pde_unload_lock);
-
- if (write)
- rv = write(file, buf, count, ppos);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static unsigned int proc_reg_poll(struct file *file, struct poll_table_struct *pts)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- unsigned int rv = DEFAULT_POLLMASK;
- unsigned int (*poll)(struct file *, struct poll_table_struct *);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- poll = pde->proc_fops->poll;
- spin_unlock(&pde->pde_unload_lock);
-
- if (poll)
- rv = poll(file, pts);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static long proc_reg_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- long rv = -ENOTTY;
- long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
- int (*ioctl)(struct inode *, struct file *, unsigned int, unsigned long);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- unlocked_ioctl = pde->proc_fops->unlocked_ioctl;
- ioctl = pde->proc_fops->ioctl;
- spin_unlock(&pde->pde_unload_lock);
-
- if (unlocked_ioctl) {
- rv = unlocked_ioctl(file, cmd, arg);
- if (rv == -ENOIOCTLCMD)
- rv = -EINVAL;
- } else if (ioctl) {
- lock_kernel();
- rv = ioctl(file->f_path.dentry->d_inode, file, cmd, arg);
- unlock_kernel();
- }
-
- pde_users_dec(pde);
- return rv;
-}
-
-#ifdef CONFIG_COMPAT
-static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- long rv = -ENOTTY;
- long (*compat_ioctl)(struct file *, unsigned int, unsigned long);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- compat_ioctl = pde->proc_fops->compat_ioctl;
- spin_unlock(&pde->pde_unload_lock);
-
- if (compat_ioctl)
- rv = compat_ioctl(file, cmd, arg);
-
- pde_users_dec(pde);
- return rv;
-}
-#endif
-
-static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma)
-{
- struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode);
- int rv = -EIO;
- int (*mmap)(struct file *, struct vm_area_struct *);
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- mmap = pde->proc_fops->mmap;
- spin_unlock(&pde->pde_unload_lock);
-
- if (mmap)
- rv = mmap(file, vma);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static int proc_reg_open(struct inode *inode, struct file *file)
-{
- struct proc_dir_entry *pde = PDE(inode);
- int rv = 0;
- int (*open)(struct inode *, struct file *);
- int (*release)(struct inode *, struct file *);
- struct pde_opener *pdeo;
-
- /*
- * What for, you ask? Well, we can have open, rmmod, remove_proc_entry
- * sequence. ->release won't be called because ->proc_fops will be
- * cleared. Depending on complexity of ->release, consequences vary.
- *
- * We can't wait for mercy when close will be done for real, it's
- * deadlockable: rmmod foo </proc/foo . So, we're going to do ->release
- * by hand in remove_proc_entry(). For this, save opener's credentials
- * for later.
- */
- pdeo = kmalloc(sizeof(struct pde_opener), GFP_KERNEL);
- if (!pdeo)
- return -ENOMEM;
-
- spin_lock(&pde->pde_unload_lock);
- if (!pde->proc_fops) {
- spin_unlock(&pde->pde_unload_lock);
- kfree(pdeo);
- return -EINVAL;
- }
- pde->pde_users++;
- open = pde->proc_fops->open;
- release = pde->proc_fops->release;
- spin_unlock(&pde->pde_unload_lock);
-
- if (open)
- rv = open(inode, file);
-
- spin_lock(&pde->pde_unload_lock);
- if (rv == 0 && release) {
- /* To know what to release. */
- pdeo->inode = inode;
- pdeo->file = file;
- /* Strictly for "too late" ->release in proc_reg_release(). */
- pdeo->release = release;
- list_add(&pdeo->lh, &pde->pde_openers);
- } else
- kfree(pdeo);
- __pde_users_dec(pde);
- spin_unlock(&pde->pde_unload_lock);
- return rv;
-}
-
-static struct pde_opener *find_pde_opener(struct proc_dir_entry *pde,
- struct inode *inode, struct file *file)
-{
- struct pde_opener *pdeo;
-
- list_for_each_entry(pdeo, &pde->pde_openers, lh) {
- if (pdeo->inode == inode && pdeo->file == file)
- return pdeo;
- }
- return NULL;
-}
-
-static int proc_reg_release(struct inode *inode, struct file *file)
-{
- struct proc_dir_entry *pde = PDE(inode);
- int rv = 0;
- int (*release)(struct inode *, struct file *);
- struct pde_opener *pdeo;
-
- spin_lock(&pde->pde_unload_lock);
- pdeo = find_pde_opener(pde, inode, file);
- if (!pde->proc_fops) {
- /*
- * Can't simply exit, __fput() will think that everything is OK,
- * and move on to freeing struct file. remove_proc_entry() will
- * find slacker in opener's list and will try to do non-trivial
- * things with struct file. Therefore, remove opener from list.
- *
- * But if opener is removed from list, who will ->release it?
- */
- if (pdeo) {
- list_del(&pdeo->lh);
- spin_unlock(&pde->pde_unload_lock);
- rv = pdeo->release(inode, file);
- kfree(pdeo);
- } else
- spin_unlock(&pde->pde_unload_lock);
- return rv;
- }
- pde->pde_users++;
- release = pde->proc_fops->release;
- if (pdeo) {
- list_del(&pdeo->lh);
- kfree(pdeo);
- }
- spin_unlock(&pde->pde_unload_lock);
-
- if (release)
- rv = release(inode, file);
-
- pde_users_dec(pde);
- return rv;
-}
-
-static const struct file_operations proc_reg_file_ops = {
- .llseek = proc_reg_llseek,
- .read = proc_reg_read,
- .write = proc_reg_write,
- .poll = proc_reg_poll,
- .unlocked_ioctl = proc_reg_unlocked_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = proc_reg_compat_ioctl,
-#endif
- .mmap = proc_reg_mmap,
- .open = proc_reg_open,
- .release = proc_reg_release,
-};
-
-#ifdef CONFIG_COMPAT
-static const struct file_operations proc_reg_file_ops_no_compat = {
- .llseek = proc_reg_llseek,
- .read = proc_reg_read,
- .write = proc_reg_write,
- .poll = proc_reg_poll,
- .unlocked_ioctl = proc_reg_unlocked_ioctl,
- .mmap = proc_reg_mmap,
- .open = proc_reg_open,
- .release = proc_reg_release,
-};
-#endif
-
struct inode *proc_get_inode(struct super_block *sb, unsigned int ino,
struct proc_dir_entry *de)
{
@@ -465,19 +141,8 @@ struct inode *proc_get_inode(struct super_block *sb, unsigned int ino,
inode->i_nlink = de->nlink;
if (de->proc_iops)
inode->i_op = de->proc_iops;
- if (de->proc_fops) {
- if (S_ISREG(inode->i_mode)) {
-#ifdef CONFIG_COMPAT
- if (!de->proc_fops->compat_ioctl)
- inode->i_fop =
- &proc_reg_file_ops_no_compat;
- else
-#endif
- inode->i_fop = &proc_reg_file_ops;
- } else {
- inode->i_fop = de->proc_fops;
- }
- }
+ if (de->proc_fops)
+ inode->i_fop = de->proc_fops;
unlock_new_inode(inode);
} else
de_put(de);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index f6db961..ea658ac 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -92,3 +92,5 @@ struct pde_opener {
struct list_head lh;
};
void pde_users_dec(struct proc_dir_entry *pde);
+
+extern struct file_system_type proc_fs_type;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 1e15a2b..ba7a99d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -96,7 +96,7 @@ static void proc_kill_sb(struct super_block *sb)
put_pid_ns(ns);
}

-static struct file_system_type proc_fs_type = {
+struct file_system_type proc_fs_type = {
.name = "proc",
.get_sb = proc_get_sb,
.kill_sb = proc_kill_sb,
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index fbfa3d4..2baeb37 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -72,10 +72,6 @@ struct proc_dir_entry {
read_proc_t *read_proc;
write_proc_t *write_proc;
atomic_t count; /* use count */
- int pde_users; /* number of callers into module in progress */
- spinlock_t pde_unload_lock; /* proc_fops checks and pde_users bumps */
- struct completion *pde_unload_completion;
- struct list_head pde_openers; /* who did ->open, but not ->release */
};

struct kcore_list {
--
1.6.1.2.350.g88cc

2009-04-11 15:59:22

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

On Sat, Apr 11, 2009 at 05:01:29AM -0700, Eric W. Biederman wrote:

> A couple of weeks ago I found myself looking at the uio, seeing that
> it does not support pci hot-unplug, and thinking "Great yet another
> implementation of hotunplug logic that needs to be added".
>
> I decided to see what it would take to add a generic implementation of
> the code we have for supporting hot unplugging devices in sysfs, proc,
> sysctl, tty_io, and now almost in the tun driver.
>
> Not long after I touched the tun driver and made it safe to delete the
> network device while still holding it's file descriptor open I someone
> else touch the code adding a different feature and my careful work
> went up in flames. Which brought home another point at the best of it
> this is ultimately complex tricky code that subsystems should not need
> to worry about.
>
> What makes this even more interesting is that in the presence of pci
> hot-unplug it looks like most subsystems and most devices will have to
> deal with the issue one way or another.

Ehh... The real mess is in things like "TTY in the middle of random
ioctl" and there's another pile that won't be solved on struct file
level - individual fs internals ;-/

> This infrastructure could also be used to implement sys_revoke and
> when I could not think of a better name I have drawn on that.

Yes, that's more or less obvious direction for revoke(), but there's a
problem with locking overhead that always scared me away from that.
Maybe I'm wrong, though... In any case, you want to carefully check
the overhead and cacheline bouncing implications for things like pipes
and sockets. Hell knows, maybe it'll work out, but...

Anyway, the really nasty part of revoke() (and true SAK, which is obviously
related) is handling of deep-inside-the-driver ioctls.

2009-04-11 16:49:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Al Viro <[email protected]> writes:

> On Sat, Apr 11, 2009 at 05:01:29AM -0700, Eric W. Biederman wrote:
>
>> A couple of weeks ago I found myself looking at the uio, seeing that
>> it does not support pci hot-unplug, and thinking "Great yet another
>> implementation of hotunplug logic that needs to be added".
>>
>> I decided to see what it would take to add a generic implementation of
>> the code we have for supporting hot unplugging devices in sysfs, proc,
>> sysctl, tty_io, and now almost in the tun driver.
>>
>> Not long after I touched the tun driver and made it safe to delete the
>> network device while still holding it's file descriptor open I someone
>> else touch the code adding a different feature and my careful work
>> went up in flames. Which brought home another point at the best of it
>> this is ultimately complex tricky code that subsystems should not need
>> to worry about.
>>
>> What makes this even more interesting is that in the presence of pci
>> hot-unplug it looks like most subsystems and most devices will have to
>> deal with the issue one way or another.
>
> Ehh... The real mess is in things like "TTY in the middle of random
> ioctl" and there's another pile that won't be solved on struct file
> level - individual fs internals ;-/

I haven't tackled code with a noticeable number of ioctls yet. But if
they are anything like what I have seen so far, a ref count to see
that you are in the still executing a function (so you don't pull the
rug out) from under it, and an additional method to say stop sleeping
and return should be sufficient.

>> This infrastructure could also be used to implement sys_revoke and
>> when I could not think of a better name I have drawn on that.
>
> Yes, that's more or less obvious direction for revoke(), but there's a
> problem with locking overhead that always scared me away from that.
> Maybe I'm wrong, though... In any case, you want to carefully check
> the overhead and cacheline bouncing implications for things like pipes
> and sockets. Hell knows, maybe it'll work out, but...

I took a careful look and I can't claim perfection at this stage but I
don't think there are any significant performance impacts from my
code. Further I am confident that if someone finds some performance
issues I will be able to understand and address them without a redesign.

While working on this I took a good hard look at the overhead I have
added to single byte reads and writes (operations that are dominated
by any possible overhead I am adding) and currently I am within 2% of
the case without my refcounting/locking.

I would be interested in anyone running micro benchmarks against my
patches and giving me feedback.

The fact that in the common case only one task ever accesses a struct
file leaves a lot of room for optimization.

> Anyway, the really nasty part of revoke() (and true SAK, which is obviously
> related) is handling of deep-inside-the-driver ioctls.

I doubt I have solved all of the problems. My goals are more modest
than a revoke that works for every possible file in the system. I
just want a common implementation of refcounting and blocking
unregistration code that can be used to solve the common problem I see
in sysfs, sysctl, proc, etc. I completely expect to need to modify
the code to take advantage of the infrastructure. Patch 9/9 has an
example of that, modifying proc so that it uses the infrastructure
I add and removing 400 lines of code.

I do think that what I have built once it is in use will make a good
foundation for building the rest of revoke. Mostly because I am solving
common problems once in a common way.

Eric

2009-04-11 16:57:17

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote:

> The fact that in the common case only one task ever accesses a struct
> file leaves a lot of room for optimization.

I'm not at all sure that it's a good assumption; even leaving aside e.g.
several tasks sharing stdout/stderr, a bunch of datagrams coming out of
several threads over the same socket is quite possible.

2009-04-11 23:57:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Al Viro <[email protected]> writes:

> On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote:
>
>> The fact that in the common case only one task ever accesses a struct
>> file leaves a lot of room for optimization.
>
> I'm not at all sure that it's a good assumption; even leaving aside e.g.
> several tasks sharing stdout/stderr, a bunch of datagrams coming out of
> several threads over the same socket is quite possible.

Maybe not. However those cases are already more expensive today.
Somewhere along the way we are already going to get cache line ping
pongs if there is real contention, and we are going to see the cost of
atomic operations. In which case the extra ref counting I am doing is
a little more expensive. And when I say a little more expensive I
mean 10-20ns per read/write more expensive.

At the same time if the common case really is applications not sharing
file descriptors (which seems sane) my current optimization easily
keeps the cost to practically nothing.

Using the srcu locking would also keep the cost down in the noise
because it guarantees non-shared cachelines and no expensive atomic
operations. srcu has the downside of requiring per cpu memory which
seems wrong to me somehow. However there are hybrid models like what
is used in mnt_want_write that are possible to limit the total amount
of per cpu memory while still getting the advantages.

Beyond that for correctness it looks like a pay me now or pay me later
situation. Do we track when we are in the methods for an object
generically where we can do the work once, and then concentrate on
enhancements. Or do we bog ourselves down using inferior
implementations that are replicated in varying ways from subsystem to
subsystem, and spend our time fighting the bugs in the subsystems?

I have the refcount/locking abstraction wrapped and have only to
perform the most basic of optimizations. So if we need to do something
more it should be easy.

Is performance your only concern with my patches?

Eric

2009-04-12 18:57:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
> always ready for I/O and return -EIO from all other operations.

I think read should return -EIO too. If a program is reading from a
/proc file (say), and the thing it's reading suddenly disappears, EOF
gives the false impression that it's read to the end of formatted data
from that file and it can process the data as if it's complete, which
is wrong.

-- Jamie

2009-04-12 20:04:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

Jamie Lokier <[email protected]> writes:

>> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
>> always ready for I/O and return -EIO from all other operations.
>
> I think read should return -EIO too. If a program is reading from a
> /proc file (say), and the thing it's reading suddenly disappears, EOF
> gives the false impression that it's read to the end of formatted data
> from that file and it can process the data as if it's complete, which
> is wrong.

Good point EIO is the current read return value for a removed proc file.

For closed pipes, and hung up ttys the read return value is 0, and from
my reading that is what bsd returns after a sys_revoke.

The reason I have f_op settable is because I never expected complete
agreement on the return codes, and because it makes auditing and spotting
this kind of thing easier.

I guess I should make two variations on revoked_file_ops then. Say
eof_file_ops, eio_file_ops. Identical except for their treatment of
reads.

Eric

2009-04-12 20:21:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Al Viro <[email protected]> writes:

> On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote:
>
>> The fact that in the common case only one task ever accesses a struct
>> file leaves a lot of room for optimization.
>
> I'm not at all sure that it's a good assumption; even leaving aside e.g.
> several tasks sharing stdout/stderr, a bunch of datagrams coming out of
> several threads over the same socket is quite possible.

I have thought about this a little more and a solution to ensure this is
not a problem for code that has not opted in to this new functionality is
simple. Require uses that need it to set FMODE_REVOKE.

It is no extra code and it keeps the absolute worst case behavior for
existing code down an additional branch mispredict.

It is worth doing anyway because it cleans up the abstraction and makes
it clear where revoke is supported.

Eric

2009-04-12 20:31:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

Eric W. Biederman wrote:
> >> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
> >> always ready for I/O and return -EIO from all other operations.
> >
> > I think read should return -EIO too. If a program is reading from a
> > /proc file (say), and the thing it's reading suddenly disappears, EOF
> > gives the false impression that it's read to the end of formatted data
> > from that file and it can process the data as if it's complete, which
> > is wrong.
>
> Good point EIO is the current read return value for a removed proc file.
>
> For closed pipes, and hung up ttys the read return value is 0, and from
> my reading that is what bsd returns after a sys_revoke.

A few suggestions below. Feel free to ignore them on account of the
basic revoking functionality being more important :-)

I'm not sure a revoked pipe should look like a normally closed one.
ECONNRESET?

For hung up ttys, I agree. But where's the SIGHUP :-) You probably do
want the process using it to die if it's not handling SIGHUP, because
terminal-using processes don't always terminate themselves on EOF.

For things writing to a pipe or file, SIGPIPE may be appropriate in
addition to EIO, to avoid runaway processes. Looks odd I know. For
writing to a terminal, SIGHUP again.

> The reason I have f_op settable is because I never expected complete
> agreement on the return codes, and because it makes auditing and spotting
> this kind of thing easier.
>
> I guess I should make two variations on revoked_file_ops then. Say
> eof_file_ops, eio_file_ops. Identical except for their treatment of
> reads.

Fair enough. It's good to have good defaults. I'm not convinced
eof_file_ops is ever a good default. sighup_file_ops and
sigpipe_file_ops maybe :-)

-- Jamie

2009-04-12 20:55:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

[email protected] (Eric W. Biederman) writes:

> Jamie Lokier <[email protected]> writes:
>
>>> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
>>> always ready for I/O and return -EIO from all other operations.
>>
>> I think read should return -EIO too. If a program is reading from a
>> /proc file (say), and the thing it's reading suddenly disappears, EOF
>> gives the false impression that it's read to the end of formatted data
>> from that file and it can process the data as if it's complete, which
>> is wrong.

I just thought about that some more and I am not convinced.

In general the current return values from proc after an I/O operation
are suspect. seek returns -EINVAL instead of -EIO. poll returns
DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced
that the existing proc return values on error are correct, and they
are recent additions so the historical precedent is not especially
large.

EOF does give the impression that you have read all of the data from
the /proc file, and that is in fact the case. There is no more
data coming from that proc file.

That the data is stale is well know.

That the data is not atomic, anything that spans more than a single
read is not atomic.

So I don't see what returning EIO adds to the equation. Perhaps
that your fragile user space string parser may break?

EOF gives a clear indication the application should stop reading
the data, because there is no more.

EIO only says that the was a problem.

I don't know of anything that depends on the rmmod behavior either
way. But if we can get away with it I would like to use something
that is generally useful instead of something that only makes
sense in the context of proc.

Eric

2009-04-12 21:03:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

Eric W. Biederman wrote:
> I just thought about that some more and I am not convinced.
>
> In general the current return values from proc after an I/O operation
> are suspect. seek returns -EINVAL instead of -EIO. poll returns
> DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced
> that the existing proc return values on error are correct, and they
> are recent additions so the historical precedent is not especially
> large.
>
> EOF does give the impression that you have read all of the data from
> the /proc file, and that is in fact the case. There is no more
> data coming from that proc file.
>
> That the data is stale is well know.
>
> That the data is not atomic, anything that spans more than a single
> read is not atomic.
>
> So I don't see what returning EIO adds to the equation. Perhaps
> that your fragile user space string parser may break?
>
> EOF gives a clear indication the application should stop reading
> the data, because there is no more.
>
> EIO only says that the was a problem.
>
> I don't know of anything that depends on the rmmod behavior either
> way. But if we can get away with it I would like to use something
> that is generally useful instead of something that only makes
> sense in the context of proc.

I'm not thinking of proc, really. More thinking of applications: EOF
effectively means "whole file read without error - now do the next thing".

If a filesystem file is revoked (umount -f), you definitely want to
stop that Makefile which is copying a file from the unmounted
filesystem to a target file. Otherwise you get inconsistent states
which can only occur as a result of this umount -f, something
Makefiles should never have to care about.

rmmod behaviour is not something any app should see normally.
Unexpected behaviour when files are oddly truncated (despite never
being written that way) is not "fragile user space". So whatever it
returns, it should be some error code, imho.

-- Jamie

2009-04-12 21:54:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

Jamie Lokier <[email protected]> writes:

> Eric W. Biederman wrote:
>> >> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is
>> >> always ready for I/O and return -EIO from all other operations.
>> >
>> > I think read should return -EIO too. If a program is reading from a
>> > /proc file (say), and the thing it's reading suddenly disappears, EOF
>> > gives the false impression that it's read to the end of formatted data
>> > from that file and it can process the data as if it's complete, which
>> > is wrong.
>>
>> Good point EIO is the current read return value for a removed proc file.
>>
>> For closed pipes, and hung up ttys the read return value is 0, and from
>> my reading that is what bsd returns after a sys_revoke.
>
> A few suggestions below. Feel free to ignore them on account of the
> basic revoking functionality being more important :-)

I think I will. This seems to be the part of the code that is easily
approachable and it is going to be easy to have different opinions on,
and there is no one right answer.

For now I'm just going to pick my best understanding of what BSD did.

Eric

2009-04-12 23:06:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations

Jamie Lokier <[email protected]> writes:

> Eric W. Biederman wrote:
>> I just thought about that some more and I am not convinced.
>>
>> In general the current return values from proc after an I/O operation
>> are suspect. seek returns -EINVAL instead of -EIO. poll returns
>> DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced
>> that the existing proc return values on error are correct, and they
>> are recent additions so the historical precedent is not especially
>> large.
>>
>> EOF does give the impression that you have read all of the data from
>> the /proc file, and that is in fact the case. There is no more
>> data coming from that proc file.
>>
>> That the data is stale is well know.
>>
>> That the data is not atomic, anything that spans more than a single
>> read is not atomic.
>>
>> So I don't see what returning EIO adds to the equation. Perhaps
>> that your fragile user space string parser may break?
>>
>> EOF gives a clear indication the application should stop reading
>> the data, because there is no more.
>>
>> EIO only says that the was a problem.
>>
>> I don't know of anything that depends on the rmmod behavior either
>> way. But if we can get away with it I would like to use something
>> that is generally useful instead of something that only makes
>> sense in the context of proc.
>
> I'm not thinking of proc, really. More thinking of applications: EOF
> effectively means "whole file read without error - now do the next thing".
>
> If a filesystem file is revoked (umount -f), you definitely want to
> stop that Makefile which is copying a file from the unmounted
> filesystem to a target file. Otherwise you get inconsistent states
> which can only occur as a result of this umount -f, something
> Makefiles should never have to care about.
>
> rmmod behaviour is not something any app should see normally.
> Unexpected behaviour when files are oddly truncated (despite never
> being written that way) is not "fragile user space". So whatever it
> returns, it should be some error code, imho.

Well I just took a look at NetBSD 4.0.1 and it appears they agree with
you.

Plus I'm starting to feel a lot better about the linux manual pages,
as the revoke(2) man pages from the BSDs describe different error
codes than the implementation, and they fail to mention revoke appears
to work on ordinary files as well.

If the file is not a tty EIO is returned from read.

opens return ENXIO
writes return EIO
ioctl returns EBADF
close returns 0

Operations that just lookup the vnode simply return EBADF.

I don't know if that is perfectly correct for the linux case. EBADF
usually means the file descriptor specified isn't open.

Eric

2009-04-14 03:17:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Hello, Eric.

Eric W. Biederman wrote:
> A couple of weeks ago I found myself looking at the uio, seeing that
> it does not support pci hot-unplug, and thinking "Great yet another
> implementation of hotunplug logic that needs to be added".
>
> I decided to see what it would take to add a generic implementation of
> the code we have for supporting hot unplugging devices in sysfs, proc,
> sysctl, tty_io, and now almost in the tun driver.
>
> Not long after I touched the tun driver and made it safe to delete the
> network device while still holding it's file descriptor open I someone
> else touch the code adding a different feature and my careful work
> went up in flames. Which brought home another point at the best of it
> this is ultimately complex tricky code that subsystems should not need
> to worry about.

I like the way it's headed. I'm trying to add similar 'revoke' or
'sever' mechanism at block and char device layers so that low level
drivers don't have to worry about object lifetimes and so on. Doing
it at the file layer makes sense and can probably replace whatever
mechanism at the chardev.

The biggest obstacle was the extra in-use reference count overhead. I
thought it could be solved by implementing generic percpu reference
count similar to the one used for module reference counting. Hot path
overhead could be reduced to local_t cmpxchg (w/o LOCK prefix) on
per-cpu variable + one branch, which was pretty good. The problem was
that space and access overhead for dynamic per-cpu variables wasn't
too good, so I started working on dynamic percpu allocator.

The dynamic per-cpu allocator is pretty close to completion. Only
several archs need to be converted and it's likely to happen during
next few months. The plan after that was 1. add per-cpu local_t
accessors (might replace local_t completely) 2. add generic per-cpu
reference counter and move module reference counting to it
3. implement block/chardev sever (or revoke) support.

I think #3 can be merged with what you're working on. What do you
think?

Thanks.

--
tejun

2009-04-14 07:39:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Tejun Heo <[email protected]> writes:

> Hello, Eric.
>
> Eric W. Biederman wrote:
>> A couple of weeks ago I found myself looking at the uio, seeing that
>> it does not support pci hot-unplug, and thinking "Great yet another
>> implementation of hotunplug logic that needs to be added".
>>
>> I decided to see what it would take to add a generic implementation of
>> the code we have for supporting hot unplugging devices in sysfs, proc,
>> sysctl, tty_io, and now almost in the tun driver.
>>
>> Not long after I touched the tun driver and made it safe to delete the
>> network device while still holding it's file descriptor open I someone
>> else touch the code adding a different feature and my careful work
>> went up in flames. Which brought home another point at the best of it
>> this is ultimately complex tricky code that subsystems should not need
>> to worry about.
>
> I like the way it's headed. I'm trying to add similar 'revoke' or
> 'sever' mechanism at block and char device layers so that low level
> drivers don't have to worry about object lifetimes and so on. Doing
> it at the file layer makes sense and can probably replace whatever
> mechanism at the chardev.
>
> The biggest obstacle was the extra in-use reference count overhead. I
> thought it could be solved by implementing generic percpu reference
> count similar to the one used for module reference counting. Hot path
> overhead could be reduced to local_t cmpxchg (w/o LOCK prefix) on
> per-cpu variable + one branch, which was pretty good. The problem was
> that space and access overhead for dynamic per-cpu variables wasn't
> too good, so I started working on dynamic percpu allocator.
>
> The dynamic per-cpu allocator is pretty close to completion. Only
> several archs need to be converted and it's likely to happen during
> next few months. The plan after that was 1. add per-cpu local_t
> accessors (might replace local_t completely) 2. add generic per-cpu
> reference counter and move module reference counting to it
> 3. implement block/chardev sever (or revoke) support.
>
> I think #3 can be merged with what you're working on. What do you
> think?

Sounds reasonable.

Do you know of a case where we actually have multiple tasks accessing
a file simultaneously?

I just instrumented up my patch an so far the only case I have found
are multiple processes closing the same file. Some weird part of
bash forking extra processes.

Eric

2009-04-14 07:47:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Eric W. Biederman wrote:
> Do you know of a case where we actually have multiple tasks accessing
> a file simultaneously?

I don't have anything at hand but multithread/process server accepting
on the same socket comes to mind. I don't think it would be a very
rare thing. If you confine the scope to character devices or sysfs,
it could be quite rare tho.

> I just instrumented up my patch an so far the only case I have found
> are multiple processes closing the same file. Some weird part of
> bash forking extra processes.

Hmmm...

Thanks.

--
tejun

2009-04-14 08:28:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Tejun Heo <[email protected]> writes:

> Eric W. Biederman wrote:
>> Do you know of a case where we actually have multiple tasks accessing
>> a file simultaneously?
>
> I don't have anything at hand but multithread/process server accepting
> on the same socket comes to mind. I don't think it would be a very
> rare thing. If you confine the scope to character devices or sysfs,
> it could be quite rare tho.

Yes. I think I can safely exclude sockets, and not bother with
reference counting them.

The only strong evidence I have that multi-threading on a single file
descriptor is likely to be common is that we have pread and pwrite
syscalls. At the same time the number of races we have in struct file
if it is accessed by multiple threads at the same time, suggests
that at least for cases where you have an offset it doesn't happen often.

I cringe when I see per cpu counters for something like files that we
are likely to have a lot of. I keep imagining a quadratic explosion
in data size. In practice we are likely to have a small cpu count <=
8-16 cpus so it is likely ok. Especially if we are only allocating 8
bytes per cpu per file. I guess in total that is at most 128K per file.
8bytes*16k cpus. With the default system file-max on my systems 203871
to 705863, it looks like we would max out at between 1M and 5M per cpu.
Still a lot but survivable.

Somewhere it all falls down, but only if you max out a very rare
very large machine, and that seems to be case with just about everything.

Which all leads me to say that if we can avoid per cpu memory and not impact
performance I want to do that.

Eric

2009-04-14 08:51:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Hello, Eric.

Eric W. Biederman wrote:
> Tejun Heo <[email protected]> writes:
>> Eric W. Biederman wrote:
>>> Do you know of a case where we actually have multiple tasks accessing
>>> a file simultaneously?
>> I don't have anything at hand but multithread/process server accepting
>> on the same socket comes to mind. I don't think it would be a very
>> rare thing. If you confine the scope to character devices or sysfs,
>> it could be quite rare tho.
>
> Yes. I think I can safely exclude sockets, and not bother with
> reference counting them.
>
> The only strong evidence I have that multi-threading on a single file
> descriptor is likely to be common is that we have pread and pwrite
> syscalls. At the same time the number of races we have in struct file
> if it is accessed by multiple threads at the same time, suggests
> that at least for cases where you have an offset it doesn't happen often.
>
> I cringe when I see per cpu counters for something like files that we
> are likely to have a lot of. I keep imagining a quadratic explosion
> in data size. In practice we are likely to have a small cpu count <=
> 8-16 cpus so it is likely ok. Especially if we are only allocating 8
> bytes per cpu per file. I guess in total that is at most 128K per file.
> 8bytes*16k cpus. With the default system file-max on my systems 203871
> to 705863, it looks like we would max out at between 1M and 5M per cpu.
> Still a lot but survivable.

Not only that percpu refcnt is quite expensive to shut down too. For
modules and devices, it doesn't really matter but using it for files
on FS would be pretty scary.

> Somewhere it all falls down, but only if you max out a very rare
> very large machine, and that seems to be case with just about everything.
>
> Which all leads me to say that if we can avoid per cpu memory and not impact
> performance I want to do that.

Yeah, fully agreed there.

--
tejun

2009-04-14 15:08:14

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Eric W. Biederman wrote:
> > I don't have anything at hand but multithread/process server accepting
> > on the same socket comes to mind. I don't think it would be a very
> > rare thing. If you confine the scope to character devices or sysfs,
> > it could be quite rare tho.
>
> Yes. I think I can safely exclude sockets, and not bother with
> reference counting them.

Good idea. As well as many processes calling accept(), it's not
unusual to have two threads or processes for reading and writing
concurrently to TCP sockets, and to have a single UDP socket shared
among threads/processes for sendto.

> The only strong evidence I have that multi-threading on a single file
> descriptor is likely to be common is that we have pread and pwrite
> syscalls. At the same time the number of races we have in struct file
> if it is accessed by multiple threads at the same time, suggests
> that at least for cases where you have an offset it doesn't happen often.

Notice the preadv and pwritev syscalls added recently? They were
added because QEMU and KVM need them for performance. Those programs
have multiple threads doing I/O to the same file concurrently. It's
like a poor man's AIO, except it's more reliable than real Linux AIO :-)

Databases probably should use concurrent p{read,write}{,v} if they're
not using direct I/O and AIO. I'm not sure if the well-known
databases do. In the past there have been some poor quality
"emulations" of those syscalls prone to races, on Linux and BSD I believe.

What are the races you've noticed?

-- Jamie

2009-04-14 19:10:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support

Jamie Lokier <[email protected]> writes:

> Eric W. Biederman wrote:
>> > I don't have anything at hand but multithread/process server accepting
>> > on the same socket comes to mind. I don't think it would be a very
>> > rare thing. If you confine the scope to character devices or sysfs,
>> > it could be quite rare tho.
>>
>> Yes. I think I can safely exclude sockets, and not bother with
>> reference counting them.
>
> Good idea. As well as many processes calling accept(), it's not
> unusual to have two threads or processes for reading and writing
> concurrently to TCP sockets, and to have a single UDP socket shared
> among threads/processes for sendto.

I have been playing with what I can see when I instrument up my code.

The first thing that popped up was that we have a lots of reads/writes
to files with f_count > 1. Which defeats my micro optimization in
fops_read_lock. So in those cases I still have to pay the full cost
of an atomic even if I have an exclusive cache line.

I have found that for make -j N I tend to get N processes all
reading from the same pipe at the same time. Not a smoking
gun that my assumption that only one process will be using
a file descriptor at a time in performance paths but it certainly
shows that things are nowhere near as rare as I thought.

The good news is that I have found a much better/cheaper optimization.
Instead of per cpu or per file memory, use per task memory. It is
always uncontended, and a task appears to never use more than two files
simultaneously (stacking?).

I have just prototyped that and things are looking very promising.
Now I just need to clean everything up and resend my patches.

>> The only strong evidence I have that multi-threading on a single file
>> descriptor is likely to be common is that we have pread and pwrite
>> syscalls. At the same time the number of races we have in struct file
>> if it is accessed by multiple threads at the same time, suggests
>> that at least for cases where you have an offset it doesn't happen often.
>
> Notice the preadv and pwritev syscalls added recently? They were
> added because QEMU and KVM need them for performance. Those programs
> have multiple threads doing I/O to the same file concurrently. It's
> like a poor man's AIO, except it's more reliable than real Linux AIO :-)
>
> Databases probably should use concurrent p{read,write}{,v} if they're
> not using direct I/O and AIO. I'm not sure if the well-known
> databases do. In the past there have been some poor quality
> "emulations" of those syscalls prone to races, on Linux and BSD I believe.
>
> What are the races you've noticed?

Besides the f_pos (which pread variants handle) there is no locking on
the file read ahead state, and f_flags only got locking a month or two
ago.


Eric

2009-04-14 22:12:56

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file

Hi, Eric,

One little thing I noticed as I was looking at this...

> +int fops_substitute(struct file *file, const struct file_operations *f_op,
> + struct vm_operations_struct *vm_ops)
> +{

[...]

> + /*
> + * Wait until there are no more callers in the original
> + * file_operations methods.
> + */
> + while (atomic_long_read(&file->f_use) > 0)
> + schedule_timeout_interruptible(1);

You use an interruptible sleep here, but there's no signal check to get you
out of the loop. So it's not really interruptible. If f_use never goes to
zero (a distressingly likely possibility, I fear), this code will create
the equivalent of an unkillable D-wait state without ever actually showing
up that way in "ps".

Actually, now that I look, once you've got a signal pending you'll stay
in TASK_RUNNING, so the above could turn into a busy-wait.

Unless I've missed something...?

I have no idea what the right thing to do in the face of a signal would
be. Perhaps the wait-for-zero and release() call stuff should be dumped
into a workqueue and done asynchronously? OTOH, I can see a need to know
when the revoke operation is really done...

jon

2009-04-15 02:55:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file

Jonathan Corbet <[email protected]> writes:

> Hi, Eric,
>
> One little thing I noticed as I was looking at this...
>
>> +int fops_substitute(struct file *file, const struct file_operations *f_op,
>> + struct vm_operations_struct *vm_ops)
>> +{
>
> [...]
>
>> + /*
>> + * Wait until there are no more callers in the original
>> + * file_operations methods.
>> + */
>> + while (atomic_long_read(&file->f_use) > 0)
>> + schedule_timeout_interruptible(1);
>
> You use an interruptible sleep here, but there's no signal check to get you
> out of the loop. So it's not really interruptible. If f_use never goes to
> zero (a distressingly likely possibility, I fear), this code will create
> the equivalent of an unkillable D-wait state without ever actually showing
> up that way in "ps".

I snagged this idiom out of srcu and hadn't given it much thought.

We have a number of places in the kernel where we aren't performing work
for user space where we fib about the kind of sleep we are doing, so
we don't increase the load. In this case we are in fs code so I guess
calling this an uninterruptible sleep is fair, especially since it
looks like at some point this code path is going be called from
a syscall.

As for f_use not going to zero, we have strong progress guarantees:

- fops_read_lock at that point will not increment the count of any new
users of the file.

- There is an additional awaken_all_waiters to wake up any wait queues
that are causing syscalls to block in the kernel.


> Actually, now that I look, once you've got a signal pending you'll stay
> in TASK_RUNNING, so the above could turn into a busy-wait.
>
> Unless I've missed something...?

Well we will always schedule, so it shouldn't be a pure busy-wait,
but overall I would call this a good catch.

> I have no idea what the right thing to do in the face of a signal would
> be. Perhaps the wait-for-zero and release() call stuff should be dumped
> into a workqueue and done asynchronously? OTOH, I can see a need to know
> when the revoke operation is really done...

Yes.

For sys_revoke the wait doesn't appear necessary.

For umount -f, rmmod, or pci hotunplug we need the wait to know when we it
is safe to free up underlying data structures. And at least for the latter
two being truly interruptible is a correctness problem.

Eric

2009-06-01 21:45:34

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/23] File descriptor hot-unplug support v2


I found myself looking at the uio, seeing that it does not support pci
hot-unplug, and thinking "Great yet another implementation of
hotunplug logic that needs to be added".

I decided to see what it would take to add a generic implementation of
the code we have for supporting hot unplugging devices in sysfs, proc,
sysctl, tty_io, and now almost in the tun driver.

Not long after I touched the tun driver and made it safe to delete the
network device while still holding it's file descriptor open I someone
else touch the code adding a different feature and my careful work
went up in flames. Which brought home another point at the best of it
this is ultimately complex tricky code that subsystems should not need
to worry about.

What makes this even more interesting is that in the presence of pci
hot-unplug it looks like most subsystems and most devices will have to
deal with the issue one way or another.

This infrastructure could also be used to implement both force
unmounts and sys_revoke. When I could not think of a better name for
I have drawn on that and used revoke.

The following changes draw on and generalize the work in tty_io sysfs,
proc, and sysctl and move it into the vfs level. Where the basic
primitives are running faster, and the solution is more general.


... Changes since version 1.

All of that lead to the first version of this patchset. The feedback
I got from that was generally positive but there was a concern about
performance when two there are two simultaneous accessors to the tty
at the same time.

After looking into the performance concerns of what happens when
multiple programs access the same struct file and finding that I could
not rule out a performance regression I have gone back and redesigned
my mutual exclusion primitive creating something simpler and faster.

I have also changed my synchronization primitives extending them to
protect most of what is read-only in struct file today and abandoning
rcu-ness of struct file.

Giving up rcu-ness leads to true exclusion and makes the code much
easier to think about.

In this patchset is the basic code patchs 1-4 and a conversion of
the vfs except for the nfsd entry points. Enough for a reasonable
result.

These patches are based on Al's vfs/for-next tree.

The vfs changes in this patchset.

Documentation/filesystems/vfs.txt | 5 +
drivers/char/pty.c | 2 +-
drivers/char/tty_io.c | 22 ++--
fs/Kconfig | 4 +
fs/aio.c | 51 +++++--
fs/compat.c | 16 ++-
fs/compat_ioctl.c | 14 ++-
fs/eventpoll.c | 41 +++++-
fs/fcntl.c | 28 +++--
fs/file_table.c | 281 +++++++++++++++++++++++++++++--------
fs/inode.c | 1 +
fs/ioctl.c | 8 +-
fs/locks.c | 8 +-
fs/namei.c | 11 ++-
fs/open.c | 81 +++++++++--
fs/proc/base.c | 29 ++--
fs/read_write.c | 122 ++++++++++++----
fs/readdir.c | 20 ++-
fs/select.c | 53 ++++++-
fs/splice.c | 111 ++++++++++-----
fs/super.c | 1 -
fs/sync.c | 9 +-
include/linux/fs.h | 49 ++++++-
include/linux/mm.h | 2 +
include/linux/poll.h | 3 +
include/linux/sched.h | 7 +
include/linux/tty.h | 2 +-
mm/fadvise.c | 7 +
mm/filemap.c | 25 ++--
mm/memory.c | 98 +++++++++++++
mm/mmap.c | 78 +++++++----
mm/nommu.c | 21 +++-
security/selinux/hooks.c | 8 +-
33 files changed, 950 insertions(+), 268 deletions(-)

The necessary changes to proc to take advantage of this functionality.

fs/proc/Kconfig | 1 +
fs/proc/generic.c | 56 +++-----
fs/proc/inode.c | 354 ++++-------------------------------------------
fs/proc/internal.h | 1 +
include/linux/proc_fs.h | 4 -
5 files changed, 44 insertions(+), 372 deletions(-)

2009-06-01 21:51:21

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 01/23] mm: Introduce revoke_file_mappings.

From: Eric W. Biederman <[email protected]>

When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to receive SIGBUS until the file is unmapped.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/mm.h | 2 +
mm/memory.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..5d7480d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -808,6 +808,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);

+extern void revoke_file_mappings(struct file *file);
+
#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..5cbee3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@
#include <linux/kallsyms.h>
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/file.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -2358,6 +2359,103 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+static struct vm_operations_struct revoked_vm_ops = {
+ .fault = revoked_fault,
+};
+
+static void revoke_vma(struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ unsigned long start_addr, end_addr, size;
+ struct mm_struct *mm;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+
+ /* Switch out the locks so I can maninuplate this under the mm sem.
+ * Needed so I can call vm_ops->close.
+ */
+ mm = vma->vm_mm;
+ atomic_inc(&mm->mm_users);
+ spin_unlock(&mapping->i_mmap_lock);
+
+ /* Block page faults and other code modifying the mm. */
+ down_write(&mm->mmap_sem);
+
+ /* Lookup a vma for my file address */
+ vma = find_vma(mm, start_addr);
+ if (vma->vm_file != file)
+ goto out;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+ size = end_addr - start_addr;
+
+ /* Unlock the pages */
+ if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
+ mm->locked_vm -= vma_pages(vma);
+ vma->vm_flags &= ~VM_LOCKED;
+ }
+
+ /* Unmap the vma */
+ zap_page_range(vma, start_addr, size, NULL);
+
+ /* Unlink the vma from the file */
+ unlink_file_vma(vma);
+
+ /* Close the vma */
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ removed_exe_file_vma(vma->vm_mm);
+
+ /* Repurpose the vma */
+ vma->vm_private_data = NULL;
+ vma->vm_ops = &revoked_vm_ops;
+ vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
+out:
+ up_write(&mm->mmap_sem);
+ spin_lock(&mapping->i_mmap_lock);
+}
+
+void revoke_file_mappings(struct file *file)
+{
+ /* After a file has been marked dead update the vmas */
+ struct address_space *mapping = file->f_mapping;
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+
+ spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ revoke_vma(vma);
+ goto restart_tree;
+ }
+
+restart_list:
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ revoke_vma(vma);
+ goto restart_list;
+ }
+
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:51:46

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 02/23] vfs: Implement unpoll_file.

From: Eric W. Biederman <[email protected]>

During a revoke operation it is necessary to stop using all state that is managed
by the underlying file operations implementation. The poll wait queue is one part
of that state.

unpoll_file achieves that by walking through a specified waitqueue. Finding
any entries that were added by select or poll of that file descriptor and
awakening them. If action was taken unpoll sleeps and repeats until
the waitqueue has no entries for the spcified file.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/select.c | 31 +++++++++++++++++++++++++++++++
include/linux/poll.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..bd30fe8 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -941,3 +941,34 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
return ret;
}
#endif /* HAVE_SET_RESTORE_SIGMASK */
+
+#ifdef CONFIG_FILE_HOTPLUG
+static int unpoll_file_once(wait_queue_head_t *q, struct file *file)
+{
+ unsigned long flags;
+ wait_queue_t *curr, *next;
+ int found = 0;
+
+ spin_lock_irqsave(&q->lock, flags);
+ list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
+ struct poll_table_entry *entry;
+ if (curr->func != pollwake)
+ continue;
+ entry = container_of(curr, struct poll_table_entry, wait);
+ if (entry->filp != file)
+ continue;
+ curr->func(curr, TASK_NORMAL, 0, NULL);
+ found = 1;
+ }
+ spin_unlock_irqrestore(&q->lock, flags);
+
+ return found;
+}
+
+void unpoll_file(wait_queue_head_t *q, struct file *file)
+{
+ while (unpoll_file_once(q, file))
+ schedule_timeout_uninterruptible(1);
+}
+EXPORT_SYMBOL(unpoll_file);
+#endif
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..d388620 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -131,6 +131,8 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);

+extern void unpoll_file(wait_queue_head_t *q, struct file *file);
+
#endif /* KERNEL */

#endif /* _LINUX_POLL_H */
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:52:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 05/23] vfs: Teach lseek to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/read_write.c | 24 +++++++++++++++++-------
1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9d1e76b..c9511ce 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -136,14 +136,24 @@ EXPORT_SYMBOL(default_llseek);
loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
{
loff_t (*fn)(struct file *, loff_t, int);
+ loff_t retval = -ESPIPE;

- fn = no_llseek;
- if (file->f_mode & FMODE_LSEEK) {
- fn = default_llseek;
- if (file->f_op && file->f_op->llseek)
- fn = file->f_op->llseek;
- }
- return fn(file, offset, origin);
+ if (!(file->f_mode & FMODE_LSEEK))
+ goto out;
+
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ fn = default_llseek;
+ if (file->f_op && file->f_op->llseek)
+ fn = file->f_op->llseek;
+
+ retval = fn(file, offset, origin);
+
+ file_hotplug_read_unlock(file);
+out:
+ return retval;
}
EXPORT_SYMBOL(vfs_llseek);

--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:52:42

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 06/23] vfs: Teach read/write to use file_hotplug_read_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/compat.c | 16 +++++++++++-
fs/read_write.c | 70 +++++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 25be41c..dad9957 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1196,12 +1196,18 @@ static size_t compat_readv(struct file *file,
if (!(file->f_mode & FMODE_READ))
goto out;

+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
- goto out;
+ goto out_unlock;

ret = compat_do_readv_writev(READ, file, vec, vlen, pos);

+out_unlock:
+ file_hotplug_read_unlock(file);
out:
if (ret > 0)
add_rchar(current, ret);
@@ -1253,12 +1259,18 @@ static size_t compat_writev(struct file *file,
if (!(file->f_mode & FMODE_WRITE))
goto out;

+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
- goto out;
+ goto out_unlock;

ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);

+out_unlock:
+ file_hotplug_read_unlock(file);
out:
if (ret > 0)
add_wchar(current, ret);
diff --git a/fs/read_write.c b/fs/read_write.c
index c9511ce..718baea 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -288,12 +288,18 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;

+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
+ goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
- return -EINVAL;
+ goto out_unlock;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
- return -EFAULT;
+ goto out_unlock;

ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
@@ -309,6 +315,9 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
inc_syscr(current);
}

+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}

@@ -343,12 +352,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
{
ssize_t ret;

+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
+ goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
- return -EINVAL;
+ goto out_unlock;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
- return -EFAULT;
+ goto out_unlock;

ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
@@ -364,6 +379,9 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
inc_syscw(current);
}

+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}

@@ -676,12 +694,26 @@ out:
ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ ssize_t ret;
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
+ goto out;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
- return -EINVAL;
+ goto out_unlock;
+
+ ret = do_readv_writev(READ, file, vec, vlen, pos);

- return do_readv_writev(READ, file, vec, vlen, pos);
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}

EXPORT_SYMBOL(vfs_readv);
@@ -689,12 +721,26 @@ EXPORT_SYMBOL(vfs_readv);
ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ ssize_t ret;
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
+ goto out;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
- return -EINVAL;
+ goto out_unlock;

- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = do_readv_writev(WRITE, file, vec, vlen, pos);
+
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}

EXPORT_SYMBOL(vfs_writev);
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:53:45

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 08/23] vfs: Teach readdir to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/readdir.c | 20 +++++++++++++++-----
1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..2e147cf 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,18 +21,26 @@

int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
- struct inode *inode = file->f_path.dentry->d_inode;
- int res = -ENOTDIR;
- if (!file->f_op || !file->f_op->readdir)
+ struct inode *inode;
+ int res;
+
+ res = -EIO;
+ if (!file_hotplug_read_trylock(file))
goto out;

+ inode = file->f_path.dentry->d_inode;
+
+ res = -ENOTDIR;
+ if (!file->f_op || !file->f_op->readdir)
+ goto out_unlock;
+
res = security_file_permission(file, MAY_READ);
if (res)
- goto out;
+ goto out_unlock;

res = mutex_lock_killable(&inode->i_mutex);
if (res)
- goto out;
+ goto out_unlock;

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
@@ -40,6 +48,8 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
+out_unlock:
+ file_hotplug_read_unlock(file);
out:
return res;
}
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:53:16

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 03/23] vfs: Generalize the file_list

From: Eric W. Biederman <[email protected]>

In the implementation of revoke it is desirable to find all of the
files we want to operation on. Currently tty's and mark_files_ro
use the file_list for this, and this patch generalizes the file
list so it can be used more efficiently.

This patch starts by introducing struct file_list making the file
list a first class object. file_list_lock and file_list_unlock
are modified to take this object, making it clear which file_list
we intended to lock.

file_move is transformed into file_list_add taking a file_list and not
allowing the movement of one file to another. __dentry_open
is modified to support this by only adding normal files in open,
special files have always been ignored when walking the file_list.
__dentry_open skipping special files allows __ptmx_open and __tty_open
to safely call file_add as they are adding the file to the file_list
for the first time.

file_kill has been renamed file_list_del to make it clear what it is
doing and to keep from confusing it with a more revoke like operation.

put_filp has been modified to not take file_list_del as we are never
on a file_list when put_filp is called.

fs_may_remount_ro and mark_files_ro have been modified to walk the
inode list to find all of the inodes and then to walk the file list
on those inodes. It can be a slightly longer walk as we frequently
cache inodes that we do not have open but the overall complexity
should be about the same, these are slow path functions, and it
gives us much greater flexibility overall.

Signed-off-by: Eric W. Biederman <[email protected]>
---
drivers/char/pty.c | 2 +-
drivers/char/tty_io.c | 22 ++++-----
fs/file_table.c | 117 +++++++++++++++++++++++++---------------------
fs/inode.c | 1 +
fs/open.c | 6 ++-
fs/select.c | 2 -
fs/super.c | 1 -
include/linux/fs.h | 24 +++++++--
include/linux/tty.h | 2 +-
security/selinux/hooks.c | 8 ++--
10 files changed, 102 insertions(+), 83 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index 31038a0..585f700 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -662,7 +662,7 @@ static int __ptmx_open(struct inode *inode, struct file *filp)

set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_list_add(filp, &tty->tty_files);

retval = devpts_pty_new(inode, tty->link);
if (retval)
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 66b99a2..b5c0ca1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -235,11 +235,11 @@ static int check_tty_count(struct tty_struct *tty, const char *routine)
struct list_head *p;
int count = 0;

- file_list_lock();
- list_for_each(p, &tty->tty_files) {
+ file_list_lock(&tty->tty_files);
+ list_for_each(p, &tty->tty_files.list) {
count++;
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_SLAVE &&
tty->link && tty->link->count)
@@ -554,9 +554,9 @@ static void do_tty_hangup(struct work_struct *work)
spin_unlock(&redirect_lock);

check_tty_count(tty, "do_tty_hangup");
- file_list_lock();
+ file_list_lock(&tty->tty_files);
/* This breaks for file handles being sent over AF_UNIX sockets ? */
- list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
+ list_for_each_entry(filp, &tty->tty_files.list, f_u.fu_list) {
if (filp->f_op->write == redirected_tty_write)
cons_filp = filp;
if (filp->f_op->write != tty_write)
@@ -565,7 +565,7 @@ static void do_tty_hangup(struct work_struct *work)
tty_fasync(-1, filp, 0); /* can't block */
filp->f_op = &hung_up_tty_fops;
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
/*
* FIXME! What are the locking issues here? This may me overdoing
* things... This question is especially important now that we've
@@ -1467,10 +1467,6 @@ static void release_one_tty(struct kref *kref)
tty_driver_kref_put(driver);
module_put(driver->owner);

- file_list_lock();
- list_del_init(&tty->tty_files);
- file_list_unlock();
-
free_tty_struct(tty);
}

@@ -1678,7 +1674,7 @@ void tty_release_dev(struct file *filp)
* - do_tty_hangup no longer sees this file descriptor as
* something that needs to be handled for hangups.
*/
- file_kill(filp);
+ file_list_del(filp, &tty->tty_files);
filp->private_data = NULL;

/*
@@ -1836,7 +1832,7 @@ got_driver:
return PTR_ERR(tty);

filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_list_add(filp, &tty->tty_files);
check_tty_count(tty, "tty_open");
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_MASTER)
@@ -2779,7 +2775,7 @@ void initialize_tty_struct(struct tty_struct *tty,
mutex_init(&tty->echo_lock);
spin_lock_init(&tty->read_lock);
spin_lock_init(&tty->ctrl_lock);
- INIT_LIST_HEAD(&tty->tty_files);
+ init_file_list(&tty->tty_files);
INIT_WORK(&tty->SAK_work, do_SAK_work);

tty->driver = driver;
diff --git a/fs/file_table.c b/fs/file_table.c
index 334ce39..978f267 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -22,6 +22,7 @@
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
+#include <linux/writeback.h>

#include <asm/atomic.h>

@@ -30,9 +31,6 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
};

-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
-
/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;

@@ -285,7 +283,8 @@ void __fput(struct file *file)
cdev_put(inode->i_cdev);
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- file_kill(file);
+ if (!special_file(inode->i_mode))
+ file_list_del(file, &inode->i_files);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
@@ -352,50 +351,57 @@ void put_filp(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
security_file_free(file);
- file_kill(file);
file_free(file);
}
}

-void file_move(struct file *file, struct list_head *list)
+void init_file_list(struct file_list *files)
{
- if (!list)
- return;
- file_list_lock();
- list_move(&file->f_u.fu_list, list);
- file_list_unlock();
+ INIT_LIST_HEAD(&files->list);
+ spin_lock_init(&files->lock);
}

-void file_kill(struct file *file)
+void file_list_add(struct file *file, struct file_list *files)
{
- if (!list_empty(&file->f_u.fu_list)) {
- file_list_lock();
- list_del_init(&file->f_u.fu_list);
- file_list_unlock();
- }
+ file_list_lock(files);
+ list_add(&file->f_u.fu_list, &files->list);
+ file_list_unlock(files);
+}
+EXPORT_SYMBOL(file_list_add);
+
+void file_list_del(struct file *file, struct file_list *files)
+{
+ file_list_lock(files);
+ list_del_init(&file->f_u.fu_list);
+ file_list_unlock(files);
}
+EXPORT_SYMBOL(file_list_del);

int fs_may_remount_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *file;

/* Check that no files are currently opened for writing. */
- file_list_lock();
- list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
- struct inode *inode = file->f_path.dentry->d_inode;
-
- /* File with pending delete? */
- if (inode->i_nlink == 0)
- goto too_bad;
-
- /* Writeable file? */
- if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
- goto too_bad;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ file_list_lock(&inode->i_files);
+ list_for_each_entry(file, &inode->i_files.list, f_u.fu_list) {
+ /* File with pending delete? */
+ if (inode->i_nlink == 0)
+ goto too_bad;
+
+ /* Writeable file? */
+ if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
+ goto too_bad;
+ }
+ file_list_unlock(&inode->i_files);
}
- file_list_unlock();
+ spin_unlock(&inode_lock);
return 1; /* Tis' cool bro. */
too_bad:
- file_list_unlock();
+ file_list_unlock(&inode->i_files);
+ spin_unlock(&inode_lock);
return 0;
}

@@ -408,33 +414,38 @@ too_bad:
*/
void mark_files_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *f;

retry:
- file_list_lock();
- list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
- struct vfsmount *mnt;
- if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
- continue;
- if (!file_count(f))
- continue;
- if (!(f->f_mode & FMODE_WRITE))
- continue;
- f->f_mode &= ~FMODE_WRITE;
- if (file_check_writeable(f) != 0)
- continue;
- file_release_write(f);
- mnt = mntget(f->f_path.mnt);
- file_list_unlock();
- /*
- * This can sleep, so we can't hold
- * the file_list_lock() spinlock.
- */
- mnt_drop_write(mnt);
- mntput(mnt);
- goto retry;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ file_list_lock(&inode->i_files);
+ list_for_each_entry(f, &inode->i_files.list, f_u.fu_list) {
+ struct vfsmount *mnt;
+ if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+ continue;
+ if (!file_count(f))
+ continue;
+ if (!(f->f_mode & FMODE_WRITE))
+ continue;
+ f->f_mode &= ~FMODE_WRITE;
+ if (file_check_writeable(f) != 0)
+ continue;
+ file_release_write(f);
+ mnt = mntget(f->f_path.mnt);
+ file_list_unlock(&inode->i_files);
+ /*
+ * This can sleep, so we can't hold
+ * the file_list_lock() spinlock.
+ */
+ mnt_drop_write(mnt);
+ mntput(mnt);
+ goto retry;
+ }
+ file_list_unlock(&inode->i_files);
}
- file_list_unlock();
+ spin_unlock(&inode_lock);
}

void __init files_init(unsigned long mempages)
diff --git a/fs/inode.c b/fs/inode.c
index 9d26490..9d52d43 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -251,6 +251,7 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->inotify_watches);
mutex_init(&inode->inotify_mutex);
#endif
+ init_file_list(&inode->i_files);
}
EXPORT_SYMBOL(inode_init_once);

diff --git a/fs/open.c b/fs/open.c
index 7200e23..20c3fc0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -828,7 +828,8 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
f->f_path.mnt = mnt;
f->f_pos = 0;
f->f_op = fops_get(inode->i_fop);
- file_move(f, &inode->i_sb->s_files);
+ if (!special_file(inode->i_mode))
+ file_list_add(f, &inode->i_files);

error = security_dentry_open(f, cred);
if (error)
@@ -873,7 +874,8 @@ cleanup_all:
mnt_drop_write(mnt);
}
}
- file_kill(f);
+ if (!special_file(inode->i_mode))
+ file_list_del(f, &inode->i_files);
f->f_path.dentry = NULL;
f->f_path.mnt = NULL;
cleanup_file:
diff --git a/fs/select.c b/fs/select.c
index bd30fe8..99e4145 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -942,7 +942,6 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
}
#endif /* HAVE_SET_RESTORE_SIGMASK */

-#ifdef CONFIG_FILE_HOTPLUG
static int unpoll_file_once(wait_queue_head_t *q, struct file *file)
{
unsigned long flags;
@@ -971,4 +970,3 @@ void unpoll_file(wait_queue_head_t *q, struct file *file)
schedule_timeout_uninterruptible(1);
}
EXPORT_SYMBOL(unpoll_file);
-#endif
diff --git a/fs/super.c b/fs/super.c
index 2ea1586..477aeb4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -65,7 +65,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_more_io);
- INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73242c3..5329fd6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
return mapping->i_mmap_writable != 0;
}

+struct file_list {
+ spinlock_t lock;
+ struct list_head list;
+};
+
/*
* Use sequence counter to get consistent i_size on 32-bit processors.
*/
@@ -764,6 +769,7 @@ struct inode {
struct list_head inotify_watches; /* watches on this inode */
struct mutex inotify_mutex; /* protects the watches list */
#endif
+ struct file_list i_files;

unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
@@ -934,9 +940,15 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
+
+static inline void file_list_lock(struct file_list *files)
+{
+ spin_lock(&files->lock);
+}
+static inline void file_list_unlock(struct file_list *files)
+{
+ spin_unlock(&files->lock);
+}

#define get_file(x) atomic_long_inc(&(x)->f_count)
#define file_count(x) atomic_long_read(&(x)->f_count)
@@ -1333,7 +1345,6 @@ struct super_block {
struct list_head s_io; /* parked for writeback */
struct list_head s_more_io; /* parked for more writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
- struct list_head s_files;
/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */
@@ -2163,8 +2174,9 @@ static inline void insert_inode_hash(struct inode *inode) {
}

extern struct file * get_empty_filp(void);
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void init_file_list(struct file_list *files);
+extern void file_list_add(struct file *f, struct file_list *files);
+extern void file_list_del(struct file *f, struct file_list *files);
#ifdef CONFIG_BLOCK
struct bio;
extern void submit_bio(int, struct bio *);
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fc39db9..7f04a5e 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -250,7 +250,7 @@ struct tty_struct {
struct work_struct hangup_work;
void *disc_data;
void *driver_data;
- struct list_head tty_files;
+ struct file_list tty_files;

#define N_TTY_BUF_SIZE 4096

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2fcad7c..65afe36 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2244,8 +2244,8 @@ static inline void flush_unauthorized_files(const struct cred *cred,

tty = get_current_tty();
if (tty) {
- file_list_lock();
- if (!list_empty(&tty->tty_files)) {
+ file_list_lock(&tty->tty_files);
+ if (!list_empty(&tty->tty_files.list)) {
struct inode *inode;

/* Revalidate access to controlling tty.
@@ -2253,14 +2253,14 @@ static inline void flush_unauthorized_files(const struct cred *cred,
than using file_has_perm, as this particular open
file may belong to another process and we are only
interested in the inode-based check here. */
- file = list_first_entry(&tty->tty_files, struct file, f_u.fu_list);
+ file = list_first_entry(&tty->tty_files.list, struct file, f_u.fu_list);
inode = file->f_path.dentry->d_inode;
if (inode_has_perm(cred, inode,
FILE__READ | FILE__WRITE, NULL)) {
drop_tty = 1;
}
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
tty_kref_put(tty);
}
/* Reset controlling tty. */
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:53:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/read_write.c | 28 +++++++++----
fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
2 files changed, 95 insertions(+), 44 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 718baea..c473d74 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
goto out;
if (!(in_file->f_mode & FMODE_READ))
goto fput_in;
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(in_file))
+ goto fput_in;
retval = -EINVAL;
in_inode = in_file->f_path.dentry->d_inode;
if (!in_inode)
- goto fput_in;
+ goto unlock_in;
if (!in_file->f_op || !in_file->f_op->splice_read)
- goto fput_in;
+ goto unlock_in;
retval = -ESPIPE;
if (!ppos)
ppos = &in_file->f_pos;
else
if (!(in_file->f_mode & FMODE_PREAD))
- goto fput_in;
+ goto unlock_in;
retval = rw_verify_area(READ, in_file, ppos, count);
if (retval < 0)
- goto fput_in;
+ goto unlock_in;
count = retval;

/*
@@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
retval = -EBADF;
out_file = fget_light(out_fd, &fput_needed_out);
if (!out_file)
- goto fput_in;
+ goto unlock_in;
if (!(out_file->f_mode & FMODE_WRITE))
goto fput_out;
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(out_file))
+ goto fput_out;
retval = -EINVAL;
if (!out_file->f_op || !out_file->f_op->sendpage)
- goto fput_out;
+ goto unlock_out;
out_inode = out_file->f_path.dentry->d_inode;
retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
if (retval < 0)
- goto fput_out;
+ goto unlock_out;
count = retval;

if (!max)
@@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
pos = *ppos;
retval = -EINVAL;
if (unlikely(pos < 0))
- goto fput_out;
+ goto unlock_out;
if (unlikely(pos + count > max)) {
retval = -EOVERFLOW;
if (pos >= max)
- goto fput_out;
+ goto unlock_out;
count = max - pos;
}

@@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
if (*ppos > max)
retval = -EOVERFLOW;

+unlock_out:
+ file_hotplug_read_unlock(out_file);
fput_out:
fput_light(out_file, fput_needed_out);
+unlock_in:
+ file_hotplug_read_unlock(in_file);
fput_in:
fput_light(in_file, fput_needed_in);
out:
diff --git a/fs/splice.c b/fs/splice.c
index 666953d..fc6b3a5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,

error = -EBADF;
file = fget_light(fd, &fput);
- if (file) {
- if (file->f_mode & FMODE_WRITE)
- error = vmsplice_to_pipe(file, iov, nr_segs, flags);
- else if (file->f_mode & FMODE_READ)
- error = vmsplice_to_user(file, iov, nr_segs, flags);
+ if (!file)
+ goto out;

- fput_light(file, fput);
- }
+ if (!file_hotplug_read_trylock(file))
+ goto fput_file;

+ if (file->f_mode & FMODE_WRITE)
+ error = vmsplice_to_pipe(file, iov, nr_segs, flags);
+ else if (file->f_mode & FMODE_READ)
+ error = vmsplice_to_user(file, iov, nr_segs, flags);
+
+ file_hotplug_read_unlock(file);
+fput_file:
+ fput_light(file, fput);
+out:
return error;
}

@@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,

error = -EBADF;
in = fget_light(fd_in, &fput_in);
- if (in) {
- if (in->f_mode & FMODE_READ) {
- out = fget_light(fd_out, &fput_out);
- if (out) {
- if (out->f_mode & FMODE_WRITE)
- error = do_splice(in, off_in,
- out, off_out,
- len, flags);
- fput_light(out, fput_out);
- }
- }
+ if (!in)
+ goto out;

- fput_light(in, fput_in);
- }
+ if (!(in->f_mode & FMODE_READ))
+ goto fput_in;
+
+ error = -EIO;
+ if (!file_hotplug_read_trylock(in))
+ goto fput_in;
+
+ error = -EBADF;
+ out = fget_light(fd_out, &fput_out);
+ if (!out)
+ goto unlock_in;
+
+ if (!(out->f_mode & FMODE_WRITE))
+ goto fput_out;
+
+ error = -EIO;
+ if (!file_hotplug_read_trylock(out))
+ goto fput_out;
+
+ error = do_splice(in, off_in, out, off_out, len, flags);

+ file_hotplug_read_unlock(out);
+fput_out:
+ fput_light(out, fput_out);
+unlock_in:
+ file_hotplug_read_unlock(in);
+fput_in:
+ fput_light(in, fput_in);
+
+out:
return error;
}

@@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,

SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
{
- struct file *in;
- int error, fput_in;
+ struct file *in, *out;
+ int error, fput_in, fput_out;

if (unlikely(!len))
return 0;

error = -EBADF;
in = fget_light(fdin, &fput_in);
- if (in) {
- if (in->f_mode & FMODE_READ) {
- int fput_out;
- struct file *out = fget_light(fdout, &fput_out);
-
- if (out) {
- if (out->f_mode & FMODE_WRITE)
- error = do_tee(in, out, len, flags);
- fput_light(out, fput_out);
- }
- }
- fput_light(in, fput_in);
- }
+ if (!in)
+ goto out;
+
+ if (!(in->f_mode & FMODE_READ))
+ goto unlock_in;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(in))
+ goto fput_in;
+
+ error = -EBADF;
+ out = fget_light(fdout, &fput_out);
+ if (!out)
+ goto unlock_in;
+
+ if (!(out->f_mode & FMODE_WRITE))
+ goto fput_out;
+
+ if (!file_hotplug_read_trylock(out))
+ goto fput_out;
+
+ error = do_tee(in, out, len, flags);
+
+ file_hotplug_read_unlock(out);
+fput_out:
+ fput_light(out, fput_out);
+unlock_in:
+ file_hotplug_read_unlock(in);
+fput_in:
+ fput_light(in, fput_in);
+out:
return error;
}
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:54:20

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 10/23] vfs: Teach do_path_lookup to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/namei.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5472ed0..c4c6575 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1049,23 +1049,30 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei
if (!file)
goto out_fail;

+ retval = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto fput_fail;
+
dentry = file->f_path.dentry;

retval = -ENOTDIR;
if (!S_ISDIR(dentry->d_inode->i_mode))
- goto fput_fail;
+ goto unlock_fail;

retval = file_permission(file, MAY_EXEC);
if (retval)
- goto fput_fail;
+ goto unlock_fail;

nd->path = file->f_path;
path_get(&file->f_path);

+ file_hotplug_read_unlock(file);
fput_light(file, fput_needed);
}
return 0;

+unlock_fail:
+ file_hotplug_read_unlock(file);
fput_fail:
fput_light(file, fput_needed);
out_fail:
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:53:34

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

From: Eric W. Biederman <[email protected]>

Introduce the file_hotplug_lock to protect file->f_op, file->private,
file->f_path from revoke operations.

The file_hotplug_lock is used typically as:
error = -EIO;
if (!file_hotplug_read_trylock(file))
goto out;
....
file_hotplug_read_unlock(file);

In 5 subsystems sysfs, proc, and sysctl, tty, and sound we have support for
modifing a file descriptor so that the underlying object can go away.
In looking at the problem of pci hotunplug it appears that we
potentially need that support for all file descriptors except ones
talking to files on filesystems. Even for file descriptors referring
to files, support for file the underlying object going away is
interesting for implementing features like umount -f and sys_revoke.

The implementations in sysfs, proc and sysctl are all very similar and
are composed of several components.
- A reference count to track that the file operations are being used.
- An ability to flag the file as no longer being valid.
- An ability to wait until the file operations are no longer being used.

In addition for a complete solution we need:
- A reliable way the file structures that we need to revoke.
- To wait for but not tamper with ongoing file creation and cleanup.
- A guarantee that all with user space controlled duration are removed.

The file_hotplug_lock has a very unique implementation necessitated by
the need to have no performance impact on existing code. Classic locking
primitives and reference counting cause pipeline stalls, except for rcu
which provides no ability to preventing reading a data structure while
it is being updated.

file_hotplug_lock keeps the overhead extremely low by dedicating a
small amount of space in the task_struct to store the set of files
the task is currently in the process of using.

The revoke algorithm is simple:
- Find a file on the file_list.
If it is dying or being created come back later
* Take a reference to the file, ensuring it does not get freed while the
revoke code accesses it.
* Block out new usages of fields guarded by file_hotplug_lock.
* Kick the underlying implemenation to wake up functions that are potentially
blocked indefinitely.
* Wait until there are no tasks holding file_hotplug_read_lock
* Release the file specific data.
* Drop the file ref count.
- Repeat until the file list is empty.

The implication of this implementation is that all revoked files will
behave exactly the same way, except for policy controlled by flags in
fmode. The expected behaivor of revoked is close succeeds all other
operations return -EIO. Except for the read on ttys this matches the
historical bsd behavior.

Approriate exports are present so modular character devices can
use the file_list

Signed-off-by: Eric W. Biederman <[email protected]>
---
Documentation/filesystems/vfs.txt | 5 +
fs/Kconfig | 4 +
fs/file_table.c | 166 ++++++++++++++++++++++++++++++++++--
fs/open.c | 6 ++
include/linux/fs.h | 25 ++++++-
include/linux/sched.h | 7 ++
6 files changed, 202 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..d220fd5 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -806,6 +806,11 @@ otherwise noted.
splice_read: called by the VFS to splice data from file to a pipe. This
method is used by the splice(2) system call

+ dead: Called by the VFS to notify a file that it has been killed.
+ Typically this is used to wake up poll, read or other blocking
+ file methods, that could be indefinitely waiting for something
+ to happen.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
diff --git a/fs/Kconfig b/fs/Kconfig
index 9f7270f..2fb86b0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -265,4 +265,8 @@ endif
source "fs/nls/Kconfig"
source "fs/dlm/Kconfig"

+config FILE_HOTPLUG
+ bool
+ default n
+
endmenu
diff --git a/fs/file_table.c b/fs/file_table.c
index 978f267..9db3031 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,6 +23,7 @@
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
#include <linux/writeback.h>
+#include <linux/mm.h>

#include <asm/atomic.h>

@@ -201,7 +202,7 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
file->f_path.dentry = dentry;
file->f_path.mnt = mntget(mnt);
file->f_mapping = dentry->d_inode->i_mapping;
- file->f_mode = mode;
+ file->f_mode = mode | FMODE_OPENED;
file->f_op = fop;

/*
@@ -252,17 +253,12 @@ void drop_file_write_access(struct file *file)
}
EXPORT_SYMBOL_GPL(drop_file_write_access);

-/* __fput is called from task context when aio completion releases the last
- * last use of a struct file *. Do not use otherwise.
- */
-void __fput(struct file *file)
+static void frelease(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = dentry->d_inode;

- might_sleep();
-
fsnotify_close(file);
/*
* The function eventpoll_release() should be the first called
@@ -277,23 +273,38 @@ void __fput(struct file *file)
}
if (file->f_op && file->f_op->release)
file->f_op->release(inode, file);
- security_file_free(file);
ima_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- put_pid(file->f_owner.pid);
if (!special_file(inode->i_mode))
file_list_del(file, &inode->i_files);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
- file_free(file);
+ file->f_mapping = NULL;
+ file->f_op = NULL;
+ file->private_data = NULL;
dput(dentry);
mntput(mnt);
}

+/* __fput is called from task context when aio completion releases the last
+ * last use of a struct file *. Do not use otherwise.
+ */
+void __fput(struct file *file)
+{
+ might_sleep();
+
+ if (likely(!(file->f_mode & FMODE_DEAD)))
+ frelease(file);
+
+ security_file_free(file);
+ put_pid(file->f_owner.pid);
+ file_free(file);
+}
+
struct file *fget(unsigned int fd)
{
struct file *file;
@@ -360,6 +371,7 @@ void init_file_list(struct file_list *files)
INIT_LIST_HEAD(&files->list);
spin_lock_init(&files->lock);
}
+EXPORT_SYMBOL(init_file_list);

void file_list_add(struct file *file, struct file_list *files)
{
@@ -377,6 +389,140 @@ void file_list_del(struct file *file, struct file_list *files)
}
EXPORT_SYMBOL(file_list_del);

+#ifdef CONFIG_FILE_HOTPLUG
+
+static bool file_in_use(struct file *file)
+{
+ struct task_struct *leader, *task;
+ bool in_use = false;
+ int i;
+
+ rcu_read_lock();
+ do_each_thread(leader, task) {
+ for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
+ if (task->file_hotplug_lock[i] == file) {
+ in_use = true;
+ goto found;
+ }
+ }
+ } while_each_thread(leader, task);
+found:
+ rcu_read_unlock();
+ return in_use;
+}
+
+static int revoke_file(struct file *file)
+{
+ /* Must be called with f_count held and FMODE_OPENED set */
+ fmode_t mode;
+
+ if (!(file->f_mode & FMODE_REVOKE))
+ return -EINVAL;
+
+ /*
+ * Tell everyone this file is dead.
+ */
+ spin_lock(&file->f_ep_lock);
+ mode = file->f_mode;
+ file->f_mode |= FMODE_DEAD;
+ spin_unlock(&file->f_ep_lock);
+ if (mode & FMODE_DEAD)
+ return -EIO;
+
+ /*
+ * Notify the file we have killed it.
+ */
+ if (file->f_op->dead)
+ file->f_op->dead(file);
+
+ /*
+ * Wait until there are no more callers in the file operations.
+ */
+ if (file_in_use(file)) {
+ do {
+ schedule_timeout_uninterruptible(1);
+ } while (file_in_use(file));
+ }
+
+ revoke_file_mappings(file);
+ frelease(file);
+
+ return 0;
+}
+
+int revoke_file_list(struct file_list *files)
+{
+ struct file *file;
+ int error = 0;
+ int empty;
+
+restart:
+ file_list_lock(files);
+ list_for_each_entry(file, &files->list, f_u.fu_list) {
+
+ /* Don't touch files that have not yet been fully opened */
+ if (!(file->f_mode & FMODE_OPENED))
+ continue;
+
+ /* Ensure I am looking at the file after it was opened */
+ smp_rmb();
+
+ /* Don't touch files that are in the final stages of being closed. */
+ if (file_count(file) == 0)
+ continue;
+
+ /* Get a reference to the file */
+ if (!atomic_long_inc_not_zero(&file->f_count))
+ continue;
+
+ file_list_unlock(files);
+
+ error = revoke_file(file);
+ fput(file);
+
+ if (unlikely(error))
+ goto out;
+ goto restart;
+ }
+ empty = list_empty(&files->list);
+ file_list_unlock(files);
+ /*
+ * If the file list had files we can't touch sleep a little while
+ * and check again.
+ */
+ if (!empty) {
+ schedule_timeout_uninterruptible(1);
+ goto restart;
+ }
+out:
+ return error;
+}
+EXPORT_SYMBOL(revoke_file_list);
+
+int __lockfunc file_hotplug_read_trylock(struct file *file)
+{
+ fmode_t mode = file->f_mode;
+ int locked = 0;
+ if (!(mode & FMODE_DEAD)) {
+ struct task_struct *tsk = current;
+ int pos = tsk->file_hotplug_lock_depth;
+ if (likely(pos < MAX_FILE_HOTPLUG_LOCK_DEPTH)) {
+ tsk->file_hotplug_lock_depth = pos + 1;
+ tsk->file_hotplug_lock[pos] = file;
+ locked = 1;
+ }
+ }
+ return locked;
+}
+
+void __lockfunc file_hotplug_read_unlock(struct file *file)
+{
+ struct task_struct *tsk = current;
+ tsk->file_hotplug_lock[--(tsk->file_hotplug_lock_depth)] = NULL;
+}
+
+#endif /* CONFIG_FILE_HOTPLUG */
+
int fs_may_remount_ro(struct super_block *sb)
{
struct inode *inode;
diff --git a/fs/open.c b/fs/open.c
index 20c3fc0..d0b2433 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -809,6 +809,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
const struct cred *cred)
{
struct inode *inode;
+ fmode_t opened_fmode;
int error;

f->f_flags = flags;
@@ -857,6 +858,11 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
}
}

+ opened_fmode = f->f_mode | FMODE_OPENED;
+ /* Ensure revoke_file_list sees the opened file */
+ smp_wmb();
+ f->f_mode = opened_fmode;
+
return f;

cleanup_all:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5329fd6..f7f4c46 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,13 @@ struct inodes_stat_t {
*/
#define FMODE_NOCMTIME ((__force fmode_t)2048)

+/* File has successfully been opened */
+#define FMODE_OPENED ((__force fmode_t)4096)
+/* File supports being revoked */
+#define FMODE_REVOKE ((__force fmode_t)8192)
+/* File is dead (has been revoked) */
+#define FMODE_DEAD ((__force fmode_t)16384)
+
/*
* The below are the various read and write types that we support. Some of
* them include behavioral modifiers that send information down to the
@@ -903,6 +910,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
#define FILE_MNT_WRITE_RELEASED 2

struct file {
+ /* file_hotplug_lock f_op, private, f_path, f_mapping */
/*
* fu_list becomes invalid after file_free is called and queued via
* fu_rcuhead for RCU freeing
@@ -935,12 +943,26 @@ struct file {
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
#endif /* #ifdef CONFIG_EPOLL */
- struct address_space *f_mapping;
+ struct address_space *f_mapping; /* file_hotplug_lock or mmap_sem */
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
};

+#ifdef CONFIG_FILE_HOTPLUG
+extern int file_hotplug_read_trylock(struct file *file);
+extern void file_hotplug_read_unlock(struct file *file);
+extern int revoke_file_list(struct file_list *files);
+#else
+static inline int file_hotplug_read_trylock(struct file *file)
+{
+ return 1;
+}
+static inline void file_hotplug_read_unlock(struct file *file)
+{
+}
+#endif
+
static inline void file_list_lock(struct file_list *files)
{
spin_lock(&files->lock);
@@ -1514,6 +1536,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ void (*dead)(struct file *);
};

struct inode_operations {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..bbf1616 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1302,6 +1302,13 @@ struct task_struct {
struct irqaction *irqaction;
#endif

+/* File hotplug lock */
+#ifdef CONFIG_FILE_HOTPLUG
+#define MAX_FILE_HOTPLUG_LOCK_DEPTH 4U
+ int file_hotplug_lock_depth;
+ struct file *file_hotplug_lock[MAX_FILE_HOTPLUG_LOCK_DEPTH];
+#endif
+
/* Protection of the PI data structures: */
spinlock_t pi_lock;

--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:54:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 12/23] vfs: Teach fcntl to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fcntl.c | 28 +++++++++++++++++++---------
1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index cc8e4de..05d8961 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -344,14 +344,19 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
if (!filp)
goto out;

+ err = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
err = security_file_fcntl(filp, cmd, arg);
- if (err) {
- fput(filp);
- return err;
- }
+ if (err)
+ goto out_unlock;

err = do_fcntl(fd, cmd, arg, filp);

+out_unlock:
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
out:
return err;
@@ -369,13 +374,15 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
if (!filp)
goto out;

+ err = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
err = security_file_fcntl(filp, cmd, arg);
- if (err) {
- fput(filp);
- return err;
- }
+ if (err)
+ goto out_unlock;
+
err = -EBADF;
-
switch (cmd) {
case F_GETLK64:
err = fcntl_getlk64(filp, (struct flock64 __user *) arg);
@@ -389,6 +396,9 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
err = do_fcntl(fd, cmd, arg, filp);
break;
}
+out_unlock:
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
out:
return err;
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:55:53

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 09/23] vfs: Teach poll and select to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/select.c | 24 +++++++++++++++++-------
include/linux/poll.h | 1 +
2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 99e4145..fd68da0 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -416,10 +416,15 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
continue;
file = fget_light(i, &fput_needed);
if (file) {
- f_op = file->f_op;
- mask = DEFAULT_POLLMASK;
- if (f_op && f_op->poll)
- mask = (*f_op->poll)(file, retval ? NULL : wait);
+ mask = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(file)) {
+ f_op = file->f_op;
+ mask = DEFAULT_POLLMASK;
+ if (f_op && f_op->poll)
+ mask = (*f_op->poll)(file, retval ? NULL : wait);
+
+ file_hotplug_read_unlock(file);
+ }
fput_light(file, fput_needed);
if ((mask & POLLIN_SET) && (in & bit)) {
res_in |= bit;
@@ -684,9 +689,14 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
file = fget_light(fd, &fput_needed);
mask = POLLNVAL;
if (file != NULL) {
- mask = DEFAULT_POLLMASK;
- if (file->f_op && file->f_op->poll)
- mask = file->f_op->poll(file, pwait);
+ mask = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(file)) {
+ mask = DEFAULT_POLLMASK;
+ if (file->f_op && file->f_op->poll)
+ mask = file->f_op->poll(file, pwait);
+
+ file_hotplug_read_unlock(file);
+ }
/* Mask out unneeded events. */
mask &= pollfd->events | POLLERR | POLLHUP;
fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index d388620..f0512f4 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -22,6 +22,7 @@
#define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))

#define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)
+#define DEAD_POLLMASK (DEFAULT_POLLMASK | POLLERR)

struct poll_table_struct;

--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:54:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 13/23] vfs: Teach ioctl to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/compat_ioctl.c | 14 ++++++++++----
fs/ioctl.c | 8 +++++++-
2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index b83f6bc..fa654c5 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2796,10 +2796,14 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
if (!filp)
goto out;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
/* RED-PEN how should LSM module know it's handling 32bit? */
error = security_file_ioctl(filp, cmd, arg);
if (error)
- goto out_fput;
+ goto out_unlock;

/*
* To allow the compat_ioctl handlers to be self contained
@@ -2825,7 +2829,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
if (filp->f_op && filp->f_op->compat_ioctl) {
error = filp->f_op->compat_ioctl(filp, cmd, arg);
if (error != -ENOIOCTLCMD)
- goto out_fput;
+ goto out_unlock;
}

if (!filp->f_op ||
@@ -2853,18 +2857,20 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
error = -EINVAL;
}

- goto out_fput;
+ goto out_unlock;

found_handler:
if (t->handler) {
lock_kernel();
error = t->handler(fd, cmd, arg, filp);
unlock_kernel();
- goto out_fput;
+ goto out_unlock;
}

do_ioctl:
error = do_vfs_ioctl(filp, fd, cmd, arg);
+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_fput:
fput_light(filp, fput_needed);
out:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 82d9c42..2dad7ba 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -577,11 +577,17 @@ SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
if (!filp)
goto out;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
error = security_file_ioctl(filp, cmd, arg);
if (error)
- goto out_fput;
+ goto out_unlock;

error = do_vfs_ioctl(filp, fd, cmd, arg);
+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_fput:
fput_light(filp, fput_needed);
out:
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:56:20

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 16/23] vfs: Teach fstatfs, fstatfs64, ftruncate, fchdir, fchmod, fchown to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/open.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 83d6369..354646b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -167,9 +167,14 @@ SYSCALL_DEFINE2(fstatfs, unsigned int, fd, struct statfs __user *, buf)
file = fget(fd);
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
error = vfs_statfs_native(file->f_path.dentry, &tmp);
if (!error && copy_to_user(buf, &tmp, sizeof(tmp)))
error = -EFAULT;
+ file_hotplug_read_unlock(file);
+out_putf:
fput(file);
out:
return error;
@@ -188,9 +193,14 @@ SYSCALL_DEFINE3(fstatfs64, unsigned int, fd, size_t, sz, struct statfs64 __user
file = fget(fd);
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
error = vfs_statfs64(file->f_path.dentry, &tmp);
if (!error && copy_to_user(buf, &tmp, sizeof(tmp)))
error = -EFAULT;
+ file_hotplug_read_unlock(file);
+out_putf:
fput(file);
out:
return error;
@@ -309,6 +319,10 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
if (!file)
goto out;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
/* explicitly opened as large or we are on 64-bit box */
if (file->f_flags & O_LARGEFILE)
small = 0;
@@ -317,16 +331,16 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
inode = dentry->d_inode;
error = -EINVAL;
if (!S_ISREG(inode->i_mode) || !(file->f_mode & FMODE_WRITE))
- goto out_putf;
+ goto out_unlock;

error = -EINVAL;
/* Cannot ftruncate over 2^31 bytes without large file support */
if (small && length > MAX_NON_LFS)
- goto out_putf;
+ goto out_unlock;

error = -EPERM;
if (IS_APPEND(inode))
- goto out_putf;
+ goto out_unlock;

error = locks_verify_truncate(inode, file, length);
if (!error)
@@ -334,6 +348,9 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
ATTR_MTIME|ATTR_CTIME);
if (!error)
error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
+
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -560,15 +577,21 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
if (!file)
goto out;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
inode = file->f_path.dentry->d_inode;

error = -ENOTDIR;
if (!S_ISDIR(inode->i_mode))
- goto out_putf;
+ goto out_unlock;

error = inode_permission(inode, MAY_EXEC | MAY_ACCESS);
if (!error)
set_fs_pwd(current->fs, &file->f_path);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -612,6 +635,10 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)
if (!file)
goto out;

+ err = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
dentry = file->f_path.dentry;
inode = dentry->d_inode;

@@ -619,7 +646,7 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)

err = mnt_want_write_file(file);
if (err)
- goto out_putf;
+ goto out_unlock;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -628,6 +655,8 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
mnt_drop_write(file->f_path.mnt);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -766,13 +795,19 @@ SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
if (!file)
goto out;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
+
error = mnt_want_write_file(file);
if (error)
- goto out_fput;
+ goto out_unlock;
dentry = file->f_path.dentry;
audit_inode(NULL, dentry);
error = chown_common(dentry, user, group);
mnt_drop_write(file->f_path.mnt);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_fput:
fput(file);
out:
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:55:18

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 14/23] vfs: Teach flock to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/locks.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index ec3deea..f74794e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1584,9 +1584,13 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
!(filp->f_mode & (FMODE_READ|FMODE_WRITE)))
goto out_putf;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_putf;
+
error = flock_make_lock(filp, &lock, cmd);
if (error)
- goto out_putf;
+ goto out_unlock;
if (can_sleep)
lock->fl_flags |= FL_SLEEP;

@@ -1604,6 +1608,8 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
out_free:
locks_free_lock(lock);

+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_putf:
fput(filp);
out:
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:55:40

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 11/23] mm: Teach mmap to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
mm/mmap.c | 78 +++++++++++++++++++++++++++++++++++++++--------------------
mm/nommu.c | 21 +++++++++++++++-
2 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6b7b1a9..f13251a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -914,9 +914,13 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
struct mm_struct * mm = current->mm;
struct inode *inode;
unsigned int vm_flags;
- int error;
+ unsigned long retval;
unsigned long reqprot = prot;

+ retval = -EIO;
+ if (file && !file_hotplug_read_trylock(file))
+ goto out;
+
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
@@ -927,35 +931,40 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
prot |= PROT_EXEC;

+ retval = -EINVAL;
if (!len)
- return -EINVAL;
+ goto out_unlock;

if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);

- error = arch_mmap_check(addr, len, flags);
- if (error)
- return error;
+ retval = arch_mmap_check(addr, len, flags);
+ if (retval)
+ goto out_unlock;

/* Careful about overflows.. */
+ retval = -ENOMEM;
len = PAGE_ALIGN(len);
if (!len || len > TASK_SIZE)
- return -ENOMEM;
+ goto out_unlock;

/* offset overflow? */
+ retval = -EOVERFLOW;
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
- return -EOVERFLOW;
+ goto out_unlock;

/* Too many mappings? */
+ retval = -ENOMEM;
if (mm->map_count > sysctl_max_map_count)
- return -ENOMEM;
+ goto out_unlock;

/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
addr = get_unmapped_area(file, addr, len, pgoff, flags);
+ retval = addr;
if (addr & ~PAGE_MASK)
- return addr;
+ goto out_unlock;

/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
@@ -965,8 +974,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED) {
+ retval = -EPERM;
if (!can_do_mlock())
- return -EPERM;
+ goto out_unlock;
vm_flags |= VM_LOCKED;
}

@@ -977,8 +987,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
locked += mm->locked_vm;
lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
lock_limit >>= PAGE_SHIFT;
+ retval = -EAGAIN;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- return -EAGAIN;
+ goto out_unlock;
}

inode = file ? file->f_path.dentry->d_inode : NULL;
@@ -986,21 +997,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (file) {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
+ retval = -EACCES;
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
- return -EACCES;
+ goto out_unlock;

/*
* Make sure we don't allow writing to an append-only
* file..
*/
+ retval = -EACCES;
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
- return -EACCES;
+ goto out_unlock;

/*
* Make sure there are no mandatory locks on the file.
*/
+ retval = -EAGAIN;
if (locks_verify_locked(inode))
- return -EAGAIN;
+ goto out_unlock;

vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
@@ -1008,20 +1022,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

/* fall through */
case MAP_PRIVATE:
+ retval = -EACCES;
if (!(file->f_mode & FMODE_READ))
- return -EACCES;
+ goto out_unlock;
if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+ retval = -EPERM;
if (vm_flags & VM_EXEC)
- return -EPERM;
+ goto out_unlock;
vm_flags &= ~VM_MAYEXEC;
}

+ retval = -ENODEV;
if (!file->f_op || !file->f_op->mmap)
- return -ENODEV;
+ goto out_unlock;
break;

default:
- return -EINVAL;
+ retval = -EINVAL;
+ goto out_unlock;
}
} else {
switch (flags & MAP_TYPE) {
@@ -1039,18 +1057,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
pgoff = addr >> PAGE_SHIFT;
break;
default:
- return -EINVAL;
+ retval = -EINVAL;
+ goto out_unlock;
}
}

- error = security_file_mmap(file, reqprot, prot, flags, addr, 0);
- if (error)
- return error;
- error = ima_file_mmap(file, prot);
- if (error)
- return error;
+ retval = security_file_mmap(file, reqprot, prot, flags, addr, 0);
+ if (retval)
+ goto out_unlock;
+ retval = ima_file_mmap(file, prot);
+ if (retval)
+ goto out_unlock;
+ retval = mmap_region(file, addr, len, flags, vm_flags, pgoff);

- return mmap_region(file, addr, len, flags, vm_flags, pgoff);
+out_unlock:
+ if (file)
+ file_hotplug_read_unlock(file);
+out:
+ return retval;
}
EXPORT_SYMBOL(do_mmap_pgoff);

diff --git a/mm/nommu.c b/mm/nommu.c
index b571ef7..08038b7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1165,7 +1165,7 @@ enomem:
/*
* handle mapping creation for uClinux
*/
-unsigned long do_mmap_pgoff(struct file *file,
+static unsigned long __do_mmap_pgoff(struct file *file,
unsigned long addr,
unsigned long len,
unsigned long prot,
@@ -1402,6 +1402,25 @@ error_getting_region:
show_free_areas();
return -ENOMEM;
}
+
+unsigned long do_mmap_pgoff(struct file *file,
+ unsigned long addr,
+ unsigned long len,
+ unsigned long prot,
+ unsigned long flags,
+ unsigned long pgoff)
+{
+ unsigned long result = -EIO;
+ if (file && !file_hotplug_read_trylock(file))
+ goto out;
+
+ result = __do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+
+ if (file)
+ file_hotplug_read_unlock(file);
+out:
+ return result;
+}
EXPORT_SYMBOL(do_mmap_pgoff);

/*
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:56:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 17/23] proc: Teach /proc/<pid>/fd to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

I have taken the opportunity to modify proc_fd_info to have
a single exit point.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fb45615..ee4cdc2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,6 +1626,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
struct files_struct *files = NULL;
struct file *file;
int fd = proc_fd(inode);
+ int retval = -ENOENT;

if (task) {
files = get_files_struct(task);
@@ -1639,24 +1640,26 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
spin_lock(&files->file_lock);
file = fcheck_files(files, fd);
if (file) {
- if (path) {
- *path = file->f_path;
- path_get(&file->f_path);
+ retval = -EIO;
+ if (file_hotplug_read_trylock(file)) {
+ retval = 0;
+ if (path) {
+ *path = file->f_path;
+ path_get(&file->f_path);
+ }
+ if (info)
+ snprintf(info, PROC_FDINFO_MAX,
+ "pos:\t%lli\n"
+ "flags:\t0%o\n",
+ (long long) file->f_pos,
+ file->f_flags);
+ file_hotplug_read_unlock(file);
}
- if (info)
- snprintf(info, PROC_FDINFO_MAX,
- "pos:\t%lli\n"
- "flags:\t0%o\n",
- (long long) file->f_pos,
- file->f_flags);
- spin_unlock(&files->file_lock);
- put_files_struct(files);
- return 0;
}
spin_unlock(&files->file_lock);
put_files_struct(files);
}
- return -ENOENT;
+ return retval;
}

static int proc_fd_link(struct inode *inode, struct path *path)
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:57:45

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
1 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a89f370..eabb167 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -627,8 +627,13 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
struct epitem *epi, *tmp;

list_for_each_entry_safe(epi, tmp, head, rdllink) {
- if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
- epi->event.events)
+ int events = DEAD_POLLMASK;
+
+ if (file_hotplug_read_trylock(epi->ffd.file)) {
+ events = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+ file_hotplug_read_unlock(epi->ffd.file);
+ }
+ if (events & epi->event.events)
return POLLIN | POLLRDNORM;
else {
/*
@@ -1060,8 +1065,12 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,

list_del_init(&epi->rdllink);

- revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
- epi->event.events;
+ revents = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(epi->ffd.file)) {
+ revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+ file_hotplug_read_unlock(epi->ffd.file);
+ }
+ revents &= epi->event.events;

/*
* If the event mask intersect the caller-requested one,
@@ -1248,10 +1257,17 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
if (!tfile)
goto error_fput;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto error_tgt_fput;
+
+ if (!file_hotplug_read_trylock(tfile))
+ goto error_file_unlock;
+
/* The target file descriptor must support poll */
error = -EPERM;
if (!tfile->f_op || !tfile->f_op->poll)
- goto error_tgt_fput;
+ goto error_tgt_unlock;

/*
* We have to check that the file structure underneath the file descriptor
@@ -1260,7 +1276,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
*/
error = -EINVAL;
if (file == tfile || !is_file_epoll(file))
- goto error_tgt_fput;
+ goto error_tgt_unlock;

/*
* At this point it is safe to assume that the "private_data" contains
@@ -1302,6 +1318,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
}
mutex_unlock(&ep->mtx);

+error_tgt_unlock:
+ file_hotplug_read_unlock(tfile);
+error_file_unlock:
+ file_hotplug_read_unlock(file);
error_tgt_fput:
fput(tfile);
error_fput:
@@ -1338,13 +1358,16 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
if (!file)
goto error_return;

+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto error_fput;
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!is_file_epoll(file))
- goto error_fput;
+ goto error_unlock;

/*
* At this point it is safe to assume that the "private_data" contains
@@ -1355,6 +1378,8 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
/* Time to fish for events ... */
error = ep_poll(ep, events, maxevents, timeout);

+error_unlock:
+ file_hotplug_read_unlock(file);
error_fput:
fput(file);
error_return:
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:57:00

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 21/23] vfs: Teach fsync to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/sync.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index e9d56f6..ac6da60 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -197,6 +197,9 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
* don't have a struct file available. Damn nfsd..
*/
if (file) {
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
mapping = file->f_mapping;
fop = file->f_op;
} else {
@@ -206,7 +209,7 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)

if (!fop || !fop->fsync) {
ret = -EINVAL;
- goto out;
+ goto out_unlock;
}

ret = filemap_fdatawrite(mapping);
@@ -223,6 +226,10 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
err = filemap_fdatawait(mapping);
if (!ret)
ret = err;
+
+out_unlock:
+ if (file)
+ file_hotplug_read_unlock(file);
out:
return ret;
}
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:56:47

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 22/23] vfs: Teach fadvice to file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
mm/fadvise.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 54a0f80..d7f1fba 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -38,6 +38,11 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
if (!file)
return -EBADF;

+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
+
+ ret = 0;
if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
@@ -123,6 +128,8 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
ret = -EINVAL;
}
out:
+ file_hotplug_read_unlock(file);
+out_fput:
fput(file);
return ret;
}
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:57:58

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 19/23] eventpoll: Fix comment

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/eventpoll.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index eabb167..d42071d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -62,7 +62,7 @@
* This mutex is acquired by ep_free() during the epoll file
* cleanup path and it is also acquired by eventpoll_release_file()
* if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call toepoll_ctl(EPOLL_CTL_DEL).
+ * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
* It is possible to drop the "ep->mtx" and to use the global
* mutex "epmutex" (together with "ep->lock") to have it working,
* but having "ep->mtx" will make the interface more scalable.
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:57:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 20/23] vfs: Teach aio to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/aio.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 76da125..eceb215 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1362,13 +1362,20 @@ static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
+ struct address_space *mapping;
+ struct inode *inode;
ssize_t (*rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
ssize_t ret = 0;
unsigned short opcode;

+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ mapping = file->f_mapping;
+ inode = mapping->host;
+
if ((iocb->ki_opcode == IOCB_CMD_PREADV) ||
(iocb->ki_opcode == IOCB_CMD_PREAD)) {
rw_op = file->f_op->aio_read;
@@ -1379,8 +1386,9 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
}

/* This matches the pread()/pwrite() logic */
+ ret = -EINVAL;
if (iocb->ki_pos < 0)
- return -EINVAL;
+ goto out_unlock;

do {
ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
@@ -1407,26 +1415,37 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
&& iocb->ki_nbytes - iocb->ki_left)
ret = iocb->ki_nbytes - iocb->ki_left;

+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}

static ssize_t aio_fdsync(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
+ ssize_t ret = -EIO;

- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 1);
+ if (file_hotplug_read_trylock(file)) {
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 1);
+ file_hotplug_read_unlock(file);
+ }
return ret;
}

static ssize_t aio_fsync(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
+ ssize_t ret = -EIO;

- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 0);
+ if (file_hotplug_read_trylock(file)) {
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 0);
+ file_hotplug_read_unlock(file);
+ }
return ret;
}

@@ -1469,7 +1488,11 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
static ssize_t aio_setup_iocb(struct kiocb *kiocb)
{
struct file *file = kiocb->ki_filp;
- ssize_t ret = 0;
+ ssize_t ret;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;

switch (kiocb->ki_opcode) {
case IOCB_CMD_PREAD:
@@ -1551,10 +1574,12 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb)
ret = -EINVAL;
}

- if (!kiocb->ki_retry)
- return ret;
+ if (kiocb->ki_retry)
+ ret = 0;

- return 0;
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}

/*
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:57:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
mm/filemap.c | 25 ++++++++++++++++---------
1 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..5016aa5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count)

ret = -EBADF;
file = fget(fd);
- if (file) {
- if (file->f_mode & FMODE_READ) {
- struct address_space *mapping = file->f_mapping;
- pgoff_t start = offset >> PAGE_CACHE_SHIFT;
- pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
- unsigned long len = end - start + 1;
- ret = do_readahead(mapping, file, start, len);
- }
- fput(file);
+ if (!file)
+ goto out;
+
+ if (!(file->f_mode & FMODE_READ))
+ goto out_fput;
+
+ if (file_hotplug_read_trylock(file)) {
+ struct address_space *mapping = file->f_mapping;
+ pgoff_t start = offset >> PAGE_CACHE_SHIFT;
+ pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+ unsigned long len = end - start + 1;
+ ret = do_readahead(mapping, file, start, len);
+ file_hotplug_read_unlock(file);
}
+out_fput:
+ fput(file);
+out:
return ret;
}
#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
--
1.6.3.1.54.g99dd.dirty

2009-06-01 21:58:22

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 15/23] vfs: Teach fallocate, and filp_close to use file_hotplug_lock

From: Eric W. Biederman <[email protected]>

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/open.c | 22 +++++++++++++++++-----
1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index d0b2433..83d6369 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -398,19 +398,22 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
goto out;
if (!(file->f_mode & FMODE_WRITE))
goto out_fput;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
/*
* Revalidate the write permissions, in case security policy has
* changed since the files were opened.
*/
ret = security_file_permission(file, MAY_WRITE);
if (ret)
- goto out_fput;
+ goto out_unlock;

inode = file->f_path.dentry->d_inode;

ret = -ESPIPE;
if (S_ISFIFO(inode->i_mode))
- goto out_fput;
+ goto out_unlock;

ret = -ENODEV;
/*
@@ -418,18 +421,20 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
* for directories or not.
*/
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
- goto out_fput;
+ goto out_unlock;

ret = -EFBIG;
/* Check for wrap through zero too */
if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
- goto out_fput;
+ goto out_unlock;

if (inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, mode, offset, len);
else
ret = -EOPNOTSUPP;

+out_unlock:
+ file_hotplug_read_unlock(file);
out_fput:
fput(file);
out:
@@ -1101,18 +1106,25 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, int, mode)
*/
int filp_close(struct file *filp, fl_owner_t id)
{
- int retval = 0;
+ int retval;

if (!file_count(filp)) {
printk(KERN_ERR "VFS: Close: file count is 0\n");
return 0;
}

+ retval = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
+ retval = 0;
if (filp->f_op && filp->f_op->flush)
retval = filp->f_op->flush(filp, id);

dnotify_flush(filp, id);
locks_remove_posix(filp, id);
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
return retval;
}
--
1.6.3.1.54.g99dd.dirty

2009-06-01 22:27:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 01/23] mm: Introduce revoke_file_mappings.

On Mon, 1 Jun 2009 14:50:26 -0700
"Eric W. Biederman" <[email protected]> wrote:

> +static void revoke_vma(struct vm_area_struct *vma)

This looks odd.

> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + unsigned long start_addr, end_addr, size;
> + struct mm_struct *mm;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;

We take a copy of start_addr/end_addr (and this end_addr value is never used)

> + /* Switch out the locks so I can maninuplate this under the mm sem.
> + * Needed so I can call vm_ops->close.
> + */
> + mm = vma->vm_mm;
> + atomic_inc(&mm->mm_users);
> + spin_unlock(&mapping->i_mmap_lock);
> +
> + /* Block page faults and other code modifying the mm. */
> + down_write(&mm->mmap_sem);
> +
> + /* Lookup a vma for my file address */
> + vma = find_vma(mm, start_addr);

Then we look up a vma. Is there reason to believe that this will
differ from the incoming arg which we just overwrote? Maybe the code
is attempting to handle racing concurrent mmap/munmap activity? If so,
what are the implications of this?

I _think_ that what the function is attempting to do is "unmap the vma
which covers the address at vma->start_addr". If so, why not just pass
it that virtual address?

Anyway, it's all a bit obscure and I do think that the semantics and
behaviour should be carefully explained in a comment, no?

> + if (vma->vm_file != file)
> + goto out;

This strengthens the theory that some sort of race-management is
happening here.

> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> + size = end_addr - start_addr;
> +
> + /* Unlock the pages */
> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
> + mm->locked_vm -= vma_pages(vma);
> + vma->vm_flags &= ~VM_LOCKED;
> + }
> +
> + /* Unmap the vma */
> + zap_page_range(vma, start_addr, size, NULL);
> +
> + /* Unlink the vma from the file */
> + unlink_file_vma(vma);
> +
> + /* Close the vma */
> + if (vma->vm_ops && vma->vm_ops->close)
> + vma->vm_ops->close(vma);
> + fput(vma->vm_file);
> + vma->vm_file = NULL;
> + if (vma->vm_flags & VM_EXECUTABLE)
> + removed_exe_file_vma(vma->vm_mm);
> +
> + /* Repurpose the vma */
> + vma->vm_private_data = NULL;
> + vma->vm_ops = &revoked_vm_ops;
> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
> +out:
> + up_write(&mm->mmap_sem);
> + spin_lock(&mapping->i_mmap_lock);
> +}

Also, I'm not a bit fan of the practice of overwriting the value of a
formal argument, especially in a function which is this large and
complex. It makes the code harder to follow, because the one variable
holds two conceptually different things within the span of the same
function. And it adds risk that someone will will later access a field
of *vma and it will be the wrong vma. Worse, the bug is only exposed
under exeedingly rare conditions.

So.. Use a new local, please.

2009-06-02 00:12:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 01/23] mm: Introduce revoke_file_mappings.

Andrew Morton <[email protected]> writes:

> On Mon, 1 Jun 2009 14:50:26 -0700
> "Eric W. Biederman" <[email protected]> wrote:
>
>> +static void revoke_vma(struct vm_area_struct *vma)
>
> This looks odd.
>
>> +{
>> + struct file *file = vma->vm_file;
>> + struct address_space *mapping = file->f_mapping;
>> + unsigned long start_addr, end_addr, size;
>> + struct mm_struct *mm;
>> +
>> + start_addr = vma->vm_start;
>> + end_addr = vma->vm_end;
>
> We take a copy of start_addr/end_addr (and this end_addr value is never used)
A foolish consistency.

>> + /* Switch out the locks so I can maninuplate this under the mm sem.
>> + * Needed so I can call vm_ops->close.
>> + */
>> + mm = vma->vm_mm;
>> + atomic_inc(&mm->mm_users);
>> + spin_unlock(&mapping->i_mmap_lock);
>> +
>> + /* Block page faults and other code modifying the mm. */
>> + down_write(&mm->mmap_sem);
>> +
>> + /* Lookup a vma for my file address */
>> + vma = find_vma(mm, start_addr);
>
> Then we look up a vma. Is there reason to believe that this will
> differ from the incoming arg which we just overwrote? Maybe the code
> is attempting to handle racing concurrent mmap/munmap activity? If so,
> what are the implications of this?

Yes it is. The file based index is only safe while we hold the i_mmap_lock.
The manipulation that needs to happen under the mmap_sem.

So I drop all of the locks and restart. And use the time honored
kernel practice of relooking up the thing I am going to manipulate.
As long as it is for the same file I don't care.

> I _think_ that what the function is attempting to do is "unmap the vma
> which covers the address at vma->start_addr". If so, why not just pass
> it that virtual address?

Actually it is unmapping a vma for the file I am revoking. I hand it
one and then it does an address space jig.

> Anyway, it's all a bit obscure and I do think that the semantics and
> behaviour should be carefully explained in a comment, no?
>
>> + if (vma->vm_file != file)
>> + goto out;
>
> This strengthens the theory that some sort of race-management is
> happening here.

Totally. I dropped all of my locks so I am having to restart in
a different locking context.

>> + start_addr = vma->vm_start;
>> + end_addr = vma->vm_end;
>> + size = end_addr - start_addr;
>> +
>> + /* Unlock the pages */
>> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
>> + mm->locked_vm -= vma_pages(vma);
>> + vma->vm_flags &= ~VM_LOCKED;
>> + }
>> +
>> + /* Unmap the vma */
>> + zap_page_range(vma, start_addr, size, NULL);
>> +
>> + /* Unlink the vma from the file */
>> + unlink_file_vma(vma);
>> +
>> + /* Close the vma */
>> + if (vma->vm_ops && vma->vm_ops->close)
>> + vma->vm_ops->close(vma);
>> + fput(vma->vm_file);
>> + vma->vm_file = NULL;
>> + if (vma->vm_flags & VM_EXECUTABLE)
>> + removed_exe_file_vma(vma->vm_mm);
>> +
>> + /* Repurpose the vma */
>> + vma->vm_private_data = NULL;
>> + vma->vm_ops = &revoked_vm_ops;
>> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
>> +out:
>> + up_write(&mm->mmap_sem);
>> + spin_lock(&mapping->i_mmap_lock);
>> +}
>
> Also, I'm not a bit fan of the practice of overwriting the value of a
> formal argument, especially in a function which is this large and
> complex. It makes the code harder to follow, because the one variable
> holds two conceptually different things within the span of the same
> function. And it adds risk that someone will will later access a field
> of *vma and it will be the wrong vma. Worse, the bug is only exposed
> under exeedingly rare conditions.
>
> So.. Use a new local, please.

We can never legitimately have more than one vma manipulated in this function.
As for the rest. I guess I just assumed that the reader of the code would
have a basic understanding of the locking rules for those data structures.

Certainly the worst thing I suffer from is being close to the code,
and not realizing which pieces are not obvious to a naive observer.

Eric

2009-06-02 05:16:55

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Hi Eric,

On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
<[email protected]> wrote:
> +#ifdef CONFIG_FILE_HOTPLUG
> +
> +static bool file_in_use(struct file *file)
> +{
> + ? ? ? struct task_struct *leader, *task;
> + ? ? ? bool in_use = false;
> + ? ? ? int i;
> +
> + ? ? ? rcu_read_lock();
> + ? ? ? do_each_thread(leader, task) {
> + ? ? ? ? ? ? ? for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
> + ? ? ? ? ? ? ? ? ? ? ? if (task->file_hotplug_lock[i] == file) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? in_use = true;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto found;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? }
> + ? ? ? } while_each_thread(leader, task);
> +found:
> + ? ? ? rcu_read_unlock();
> + ? ? ? return in_use;
> +}

This seems rather heavy-weight. If we're going to use this
infrastructure for forced unmount, I think this will be a problem.

Can't we two this in two stages: (1) mark a bit that forces
file_hotplug_read_trylock to always fail and (2) block until the last
remaining in-kernel file_hotplug_read_unlock() has executed?

Pekka

2009-06-02 06:52:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Pekka Enberg <[email protected]> writes:

> Hi Eric,
>
> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
> <[email protected]> wrote:
>> +#ifdef CONFIG_FILE_HOTPLUG
>> +
>> +static bool file_in_use(struct file *file)
>> +{
>> +       struct task_struct *leader, *task;
>> +       bool in_use = false;
>> +       int i;
>> +
>> +       rcu_read_lock();
>> +       do_each_thread(leader, task) {
>> +               for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>> +                       if (task->file_hotplug_lock[i] == file) {
>> +                               in_use = true;
>> +                               goto found;
>> +                       }
>> +               }
>> +       } while_each_thread(leader, task);
>> +found:
>> +       rcu_read_unlock();
>> +       return in_use;
>> +}
>
> This seems rather heavy-weight. If we're going to use this
> infrastructure for forced unmount, I think this will be a problem.

> Can't we two this in two stages: (1) mark a bit that forces
> file_hotplug_read_trylock to always fail and (2) block until the last
> remaining in-kernel file_hotplug_read_unlock() has executed?

Yes there is room for more optimization in the slow path.
I haven't noticed being a problem yet so I figured I would start
with stupid and simple.

I can easily see two passes. The first setting the flag an calling
f_op->dead. The second some kind of consolidate walk through the task
list, allowing checking on multiple files at once.

I'm not ready to consider anything that will add cost to the fast
path in the file descriptors though.

Eric

2009-06-02 07:06:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 03/23] vfs: Generalize the file_list

On Mon, Jun 01, 2009 at 02:50:28PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> In the implementation of revoke it is desirable to find all of the
> files we want to operation on. Currently tty's and mark_files_ro
> use the file_list for this, and this patch generalizes the file
> list so it can be used more efficiently.
>
> This patch starts by introducing struct file_list making the file
> list a first class object. file_list_lock and file_list_unlock
> are modified to take this object, making it clear which file_list
> we intended to lock.
>
> file_move is transformed into file_list_add taking a file_list and not
> allowing the movement of one file to another. __dentry_open
> is modified to support this by only adding normal files in open,
> special files have always been ignored when walking the file_list.
> __dentry_open skipping special files allows __ptmx_open and __tty_open
> to safely call file_add as they are adding the file to the file_list
> for the first time.
>
> file_kill has been renamed file_list_del to make it clear what it is
> doing and to keep from confusing it with a more revoke like operation.
>
> put_filp has been modified to not take file_list_del as we are never
> on a file_list when put_filp is called.
>
> fs_may_remount_ro and mark_files_ro have been modified to walk the
> inode list to find all of the inodes and then to walk the file list
> on those inodes. It can be a slightly longer walk as we frequently
> cache inodes that we do not have open but the overall complexity
> should be about the same,

Well not really. I have a couple of orders of magnitude more cached
inodes than open files here.


> these are slow path functions, and it
> gives us much greater flexibility overall.

Define flexibility. Walking the sb's file list and checking for
equality with the inode in question gives the same functionality,
just different performance profile.


> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
> return mapping->i_mmap_writable != 0;
> }
>
> +struct file_list {
> + spinlock_t lock;
> + struct list_head list;
> +};
> +
> /*
> * Use sequence counter to get consistent i_size on 32-bit processors.
> */
> @@ -764,6 +769,7 @@ struct inode {
> struct list_head inotify_watches; /* watches on this inode */
> struct mutex inotify_mutex; /* protects the watches list */
> #endif
> + struct file_list i_files;
>
> unsigned long i_state;
> unsigned long dirtied_when; /* jiffies of first dirtying */
> @@ -934,9 +940,15 @@ struct file {
> unsigned long f_mnt_write_state;
> #endif
> };
> -extern spinlock_t files_lock;
> -#define file_list_lock() spin_lock(&files_lock);
> -#define file_list_unlock() spin_unlock(&files_lock);
> +
> +static inline void file_list_lock(struct file_list *files)
> +{
> + spin_lock(&files->lock);
> +}
> +static inline void file_list_unlock(struct file_list *files)
> +{
> + spin_unlock(&files->lock);
> +}

I don't really like this. It's just a list head. Get rid of
all these wrappers and crap I'd say. In fact, starting with my
patch to unexport files_lock and remove these wrappers would
be reasonable, wouldn't it?

Increasing the size of the struct inode by 24 bytes hurts.
Even when you decrapify it and can reuse i_lock or something,
then it is still 16 bytes on 64-bit.

I haven't looked through all the patches... but this is to
speed up a slowpath operation, isn't it? Or does revoke
need to be especially performant?

So this patch is purely a perofrmance improvement? Then I think
it needs to be justified with numbers and the downsides (bloating
struct inode in particulra) to be changelogged.

2009-06-02 07:08:22

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Hi Eric,

On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <[email protected]> wrote:
> Pekka Enberg <[email protected]> writes:
>
>> Hi Eric,
>>
>> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
>> <[email protected]> wrote:
>>> +#ifdef CONFIG_FILE_HOTPLUG
>>> +
>>> +static bool file_in_use(struct file *file)
>>> +{
>>> + ? ? ? struct task_struct *leader, *task;
>>> + ? ? ? bool in_use = false;
>>> + ? ? ? int i;
>>> +
>>> + ? ? ? rcu_read_lock();
>>> + ? ? ? do_each_thread(leader, task) {
>>> + ? ? ? ? ? ? ? for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>>> + ? ? ? ? ? ? ? ? ? ? ? if (task->file_hotplug_lock[i] == file) {
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? in_use = true;
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto found;
>>> + ? ? ? ? ? ? ? ? ? ? ? }
>>> + ? ? ? ? ? ? ? }
>>> + ? ? ? } while_each_thread(leader, task);
>>> +found:
>>> + ? ? ? rcu_read_unlock();
>>> + ? ? ? return in_use;
>>> +}
>>
>> This seems rather heavy-weight. If we're going to use this
>> infrastructure for forced unmount, I think this will be a problem.
>
>> Can't we two this in two stages: (1) mark a bit that forces
>> file_hotplug_read_trylock to always fail and (2) block until the last
>> remaining in-kernel file_hotplug_read_unlock() has executed?
>
> Yes there is room for more optimization in the slow path.
> I haven't noticed being a problem yet so I figured I would start
> with stupid and simple.

Yup, just wanted to point it out.

On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <[email protected]> wrote:
> I can easily see two passes. ?The first setting the flag an calling
> f_op->dead. ?The second some kind of consolidate walk through the task
> list, allowing checking on multiple files at once.
>
> I'm not ready to consider anything that will add cost to the fast
> path in the file descriptors though.

Makes sense.

Pekka

2009-06-02 07:14:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

On Mon, Jun 01, 2009 at 02:50:29PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> Introduce the file_hotplug_lock to protect file->f_op, file->private,
> file->f_path from revoke operations.
>
> The file_hotplug_lock is used typically as:
> error = -EIO;
> if (!file_hotplug_read_trylock(file))
> goto out;
> ....
> file_hotplug_read_unlock(file);

Why is it called hotplug? Does it have anything to do with hardware?
Because every concurrently changed software data structure in the
kernel can be "hot"-modified, right?

Wouldn't file_revoke_lock be more appropriate?


> In addition for a complete solution we need:
> - A reliable way the file structures that we need to revoke.
> - To wait for but not tamper with ongoing file creation and cleanup.
> - A guarantee that all with user space controlled duration are removed.
>
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code. Classic locking

Well, it isn't no performance impact. Function calls, branches, icache
and dcache...

2009-06-02 16:57:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

On Mon, 1 Jun 2009, Eric W. Biederman wrote:

> From: Eric W. Biederman <[email protected]>
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
> 1 files changed, 32 insertions(+), 7 deletions(-)

This patchset gives me the willies for the amount of changes and possible
impact on many subsystems.
Without having looked at the details, are you aware that epoll does not
act like poll/select, and fds are not automatically removed (as in,
dequeued from the poll wait queue) in any foreseeable amount of time after
a POLLERR is received?
As far as the usespace API goes, they have the right to remain there.



- Davide

2009-06-02 17:07:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file



On Tue, 2 Jun 2009, Nick Piggin wrote:
>
> Why is it called hotplug? Does it have anything to do with hardware?
> Because every concurrently changed software data structure in the
> kernel can be "hot"-modified, right?
>
> Wouldn't file_revoke_lock be more appropriate?

I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
"plug" and "unplug".

Linus

2009-06-02 20:53:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Linus Torvalds <[email protected]> writes:

> On Tue, 2 Jun 2009, Nick Piggin wrote:
>>
>> Why is it called hotplug? Does it have anything to do with hardware?
>> Because every concurrently changed software data structure in the
>> kernel can be "hot"-modified, right?
>>
>> Wouldn't file_revoke_lock be more appropriate?
>
> I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
> "plug" and "unplug".

I guess this shows my bias in triggering this code path from pci
hotunplug. Instead of with some system call.

I'm not married to the name. I wanted file_lock but that is already
used, and I did call the method revoke.

The only place where hotplug gives a useful hint is that it makes it
clear we really are disconnecting the file descriptor from what lies
below it. We can't do some weird thing like keep the underlying object.
Because the underlying object is gone.

Eric

2009-06-02 21:24:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

Davide Libenzi <[email protected]> writes:

> On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>
>> From: Eric W. Biederman <[email protected]>
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
>> 1 files changed, 32 insertions(+), 7 deletions(-)
>
> This patchset gives me the willies for the amount of changes and possible
> impact on many subsystems.

It both is and is not that bad. It is the cost of adding a lock.
For the VFS except for nfsd the I have touched everything that needs to be
touched.

Other subsystems that open read/write close files should be able to use
existing vfs helpers so they don't need to know about the new locking
explicitly.

Actually taking advantage of this infrastructure in a subsystem is
comparatively easy. It took me about an hour to get uio using it.
That part is not deep by any means and is opt in.

> Without having looked at the details, are you aware that epoll does not
> act like poll/select, and fds are not automatically removed (as in,
> dequeued from the poll wait queue) in any foreseeable amount of time after
> a POLLERR is received?

Yes I am aware of how epoll acts differently.

> As far as the usespace API goes, they have the right to remain there.

I absolutely agree.

Currently I have the code acting like close() with respect to epoll and
just having the file descriptor vanish at the end of the revoke. While
we the revoke is in progress you get an EIO.

The file descriptor is not freed by a revoke operation so you can happily
hang unto it as long as you want.

I thought of doing something more uniform to user space. But I observed
that the existing epoll punts on the case of a file descriptor being closed
and locking to go from a file to the other epoll datastructures is pretty
horrid I said forget it and used the existing close behaviour.

Eric

2009-06-02 21:58:56

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

On Tue, 2 Jun 2009, Eric W. Biederman wrote:

> Davide Libenzi <[email protected]> writes:
>
> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> >
> >> From: Eric W. Biederman <[email protected]>
> >>
> >> Signed-off-by: Eric W. Biederman <[email protected]>
> >> ---
> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
> >> 1 files changed, 32 insertions(+), 7 deletions(-)
> >
> > This patchset gives me the willies for the amount of changes and possible
> > impact on many subsystems.
>
> It both is and is not that bad. It is the cost of adding a lock.

We both know that it is not only the cost of a lock, but also the
sprinkling over a pretty vast amount of subsystems, of another layer of
code.



> I thought of doing something more uniform to user space. But I observed
> that the existing epoll punts on the case of a file descriptor being closed
> and locking to go from a file to the other epoll datastructures is pretty
> horrid I said forget it and used the existing close behaviour.

Well, you cannot rely on the caller to tidy up the epoll fd by issuing an
epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave
lingering crap around. You cannot even hold a reference of the file, since
otherwise the epoll hooking will have to trigger not only at ->release()
time, but at every close, where you'll have to figure out if this is the
last real userspace reference or not. Plus all the issues related to
holding permanent extra references to userspace files.
And since a file can be added in many epoll devices, you need to
unregister it from all of them (hence the other datastructures lookup).
Better this, on the slow path, with locks acquired only in the epoll usage
case, than some other thing and on the fast path, for every file.



- Davide

2009-06-02 22:51:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

Davide Libenzi <[email protected]> writes:

> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> Davide Libenzi <[email protected]> writes:
>>
>> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>> >
>> >> From: Eric W. Biederman <[email protected]>
>> >>
>> >> Signed-off-by: Eric W. Biederman <[email protected]>
>> >> ---
>> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
>> >> 1 files changed, 32 insertions(+), 7 deletions(-)
>> >
>> > This patchset gives me the willies for the amount of changes and possible
>> > impact on many subsystems.
>>
>> It both is and is not that bad. It is the cost of adding a lock.
>
> We both know that it is not only the cost of a lock, but also the
> sprinkling over a pretty vast amount of subsystems, of another layer of
> code.

I am not clear what problem you have.

Is it the sprinkling the code that takes and removes the lock? Just
the VFS needs to be involved with that. It is a slightly larger
surface area than doing the work inside the file operations as we
sometimes call the same method from 3-4 different places but it is
definitely a bounded problem.

Is it putting in the handful lines per subsystem to actually use this
functionality? At that level something generic that is maintained
outside of the subsystem is better than the mess we have with 4-5
different implementations in the subsystems that need it, each having
a different assortment of bugs.

>> I thought of doing something more uniform to user space. But I observed
>> that the existing epoll punts on the case of a file descriptor being closed
>> and locking to go from a file to the other epoll datastructures is pretty
>> horrid I said forget it and used the existing close behaviour.
>
> Well, you cannot rely on the caller to tidy up the epoll fd by issuing an
> epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave
> lingering crap around. You cannot even hold a reference of the file, since
> otherwise the epoll hooking will have to trigger not only at ->release()
> time, but at every close, where you'll have to figure out if this is the
> last real userspace reference or not. Plus all the issues related to
> holding permanent extra references to userspace files.
> And since a file can be added in many epoll devices, you need to
> unregister it from all of them (hence the other datastructures lookup).
> Better this, on the slow path, with locks acquired only in the epoll usage
> case, than some other thing and on the fast path, for every file.

Sure, and that is largely and I am preserving those semantics.

Eric

2009-06-02 22:56:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Nick Piggin <[email protected]> writes:

>> In addition for a complete solution we need:
>> - A reliable way the file structures that we need to revoke.
>> - To wait for but not tamper with ongoing file creation and cleanup.
>> - A guarantee that all with user space controlled duration are removed.
>>
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code. Classic locking
>
> Well, it isn't no performance impact. Function calls, branches, icache
> and dcache...

Practically none.

Everything I could measure was in the noise. It is cheaper than any serializing
locking primitive. I ran both lmbench and did some microbenchmark testing.
So I know on the fast path the overhead is minimal. Certainly less than what
we are doing in sysfs and proc today.

Eric

2009-06-03 06:37:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

On Tue, Jun 02, 2009 at 01:52:46PM -0700, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Tue, 2 Jun 2009, Nick Piggin wrote:
> >>
> >> Why is it called hotplug? Does it have anything to do with hardware?
> >> Because every concurrently changed software data structure in the
> >> kernel can be "hot"-modified, right?
> >>
> >> Wouldn't file_revoke_lock be more appropriate?
> >
> > I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
> > "plug" and "unplug".
>
> I guess this shows my bias in triggering this code path from pci
> hotunplug. Instead of with some system call.
>
> I'm not married to the name. I wanted file_lock but that is already
> used, and I did call the method revoke.

Definitely it is not going to be called hotplug in the generic
vfs layer :)


> The only place where hotplug gives a useful hint is that it makes it
> clear we really are disconnecting the file descriptor from what lies
> below it.

Isn't that hotUNplug?

But anyway hot plug/unplug is a purely hardware concept. Revoke
for "unplug", please, including naming of patches, changelogs,
and locks etc.


> We can't do some weird thing like keep the underlying object.
> Because the underlying object is gone.

2009-06-03 06:38:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

On Tue, Jun 02, 2009 at 03:56:02PM -0700, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
>
> >> In addition for a complete solution we need:
> >> - A reliable way the file structures that we need to revoke.
> >> - To wait for but not tamper with ongoing file creation and cleanup.
> >> - A guarantee that all with user space controlled duration are removed.
> >>
> >> The file_hotplug_lock has a very unique implementation necessitated by
> >> the need to have no performance impact on existing code. Classic locking
> >
> > Well, it isn't no performance impact. Function calls, branches, icache
> > and dcache...
>
> Practically none.

OK that's different from none. There is obviously overhead.

2009-06-03 15:03:56

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

On Tue, 2 Jun 2009, Eric W. Biederman wrote:

> I am not clear what problem you have.
>
> Is it the sprinkling the code that takes and removes the lock? Just
> the VFS needs to be involved with that. It is a slightly larger
> surface area than doing the work inside the file operations as we
> sometimes call the same method from 3-4 different places but it is
> definitely a bounded problem.
>
> Is it putting in the handful lines per subsystem to actually use this
> functionality? At that level something generic that is maintained
> outside of the subsystem is better than the mess we have with 4-5
> different implementations in the subsystems that need it, each having
> a different assortment of bugs.

Come on, only in the open fast path, there are at least two spin
lock/unlock and two atomic ops. Without even starting to count all the
extra branches and software added.
Is this stuff *really* needed, or we can faitly happily live w/out?


- Davide

2009-06-03 20:54:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

Davide Libenzi <[email protected]> writes:

> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> I am not clear what problem you have.
>>
>> Is it the sprinkling the code that takes and removes the lock? Just
>> the VFS needs to be involved with that. It is a slightly larger
>> surface area than doing the work inside the file operations as we
>> sometimes call the same method from 3-4 different places but it is
>> definitely a bounded problem.
>>
>> Is it putting in the handful lines per subsystem to actually use this
>> functionality? At that level something generic that is maintained
>> outside of the subsystem is better than the mess we have with 4-5
>> different implementations in the subsystems that need it, each having
>> a different assortment of bugs.
>
> Come on, only in the open fast path, there are at least two spin
> lock/unlock and two atomic ops. Without even starting to count all the
> extra branches and software added.
> Is this stuff *really* needed, or we can faitly happily live w/out?

????

What code are you talking about?

To the open path a few memory writes and a smp_wmb. No atomics and no
spin lock/unlocks.

Are you complaining because I retain the file_list?

Eric

2009-06-03 23:25:43

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock

On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> mm/filemap.c | 25 ++++++++++++++++---------
> 1 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 379ff0b..5016aa5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count)
>
> ret = -EBADF;
> file = fget(fd);
> - if (file) {
> - if (file->f_mode & FMODE_READ) {
> - struct address_space *mapping = file->f_mapping;
> - pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> - pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
> - unsigned long len = end - start + 1;
> - ret = do_readahead(mapping, file, start, len);
> - }
> - fput(file);
> + if (!file)
> + goto out;
> +
> + if (!(file->f_mode & FMODE_READ))
> + goto out_fput;
> +

To be consistent with others, don't you want to do
ret = -EIO;
here ?
> + if (file_hotplug_read_trylock(file)) {
> + struct address_space *mapping = file->f_mapping;
> + pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> + pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
> + unsigned long len = end - start + 1;
> + ret = do_readahead(mapping, file, start, len);
> + file_hotplug_read_unlock(file);
> }
> +out_fput:
> + fput(file);
> +out:
> return ret;
> }
> #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS

Thanks,
Badari

2009-06-03 23:39:46

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock

On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/read_write.c | 28 +++++++++----
> fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
> 2 files changed, 95 insertions(+), 44 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 718baea..c473d74 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> goto out;
> if (!(in_file->f_mode & FMODE_READ))
> goto fput_in;
> + retval = -EIO;
> + if (!file_hotplug_read_trylock(in_file))
> + goto fput_in;
> retval = -EINVAL;
> in_inode = in_file->f_path.dentry->d_inode;
> if (!in_inode)
> - goto fput_in;
> + goto unlock_in;
> if (!in_file->f_op || !in_file->f_op->splice_read)
> - goto fput_in;
> + goto unlock_in;
> retval = -ESPIPE;
> if (!ppos)
> ppos = &in_file->f_pos;
> else
> if (!(in_file->f_mode & FMODE_PREAD))
> - goto fput_in;
> + goto unlock_in;
> retval = rw_verify_area(READ, in_file, ppos, count);
> if (retval < 0)
> - goto fput_in;
> + goto unlock_in;
> count = retval;
>
> /*
> @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> retval = -EBADF;
> out_file = fget_light(out_fd, &fput_needed_out);
> if (!out_file)
> - goto fput_in;
> + goto unlock_in;
> if (!(out_file->f_mode & FMODE_WRITE))
> goto fput_out;
> + retval = -EIO;
> + if (!file_hotplug_read_trylock(out_file))
> + goto fput_out;
> retval = -EINVAL;
> if (!out_file->f_op || !out_file->f_op->sendpage)
> - goto fput_out;
> + goto unlock_out;
> out_inode = out_file->f_path.dentry->d_inode;
> retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
> if (retval < 0)
> - goto fput_out;
> + goto unlock_out;
> count = retval;
>
> if (!max)
> @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> pos = *ppos;
> retval = -EINVAL;
> if (unlikely(pos < 0))
> - goto fput_out;
> + goto unlock_out;
> if (unlikely(pos + count > max)) {
> retval = -EOVERFLOW;
> if (pos >= max)
> - goto fput_out;
> + goto unlock_out;
> count = max - pos;
> }
>
> @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> if (*ppos > max)
> retval = -EOVERFLOW;
>
> +unlock_out:
> + file_hotplug_read_unlock(out_file);
> fput_out:
> fput_light(out_file, fput_needed_out);
> +unlock_in:
> + file_hotplug_read_unlock(in_file);
> fput_in:
> fput_light(in_file, fput_needed_in);
> out:
> diff --git a/fs/splice.c b/fs/splice.c
> index 666953d..fc6b3a5 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
>
> error = -EBADF;
> file = fget_light(fd, &fput);
> - if (file) {
> - if (file->f_mode & FMODE_WRITE)
> - error = vmsplice_to_pipe(file, iov, nr_segs, flags);
> - else if (file->f_mode & FMODE_READ)
> - error = vmsplice_to_user(file, iov, nr_segs, flags);
> + if (!file)
> + goto out;
>
> - fput_light(file, fput);
> - }
> + if (!file_hotplug_read_trylock(file))
> + goto fput_file;
>
> + if (file->f_mode & FMODE_WRITE)
> + error = vmsplice_to_pipe(file, iov, nr_segs, flags);
> + else if (file->f_mode & FMODE_READ)
> + error = vmsplice_to_user(file, iov, nr_segs, flags);
> +
> + file_hotplug_read_unlock(file);
> +fput_file:
> + fput_light(file, fput);
> +out:
> return error;
> }
>
> @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
>
> error = -EBADF;
> in = fget_light(fd_in, &fput_in);
> - if (in) {
> - if (in->f_mode & FMODE_READ) {
> - out = fget_light(fd_out, &fput_out);
> - if (out) {
> - if (out->f_mode & FMODE_WRITE)
> - error = do_splice(in, off_in,
> - out, off_out,
> - len, flags);
> - fput_light(out, fput_out);
> - }
> - }
> + if (!in)
> + goto out;
>
> - fput_light(in, fput_in);
> - }
> + if (!(in->f_mode & FMODE_READ))
> + goto fput_in;
> +
> + error = -EIO;
> + if (!file_hotplug_read_trylock(in))
> + goto fput_in;
> +
> + error = -EBADF;
> + out = fget_light(fd_out, &fput_out);
> + if (!out)
> + goto unlock_in;
> +
> + if (!(out->f_mode & FMODE_WRITE))
> + goto fput_out;
> +
> + error = -EIO;
> + if (!file_hotplug_read_trylock(out))
> + goto fput_out;
> +
> + error = do_splice(in, off_in, out, off_out, len, flags);
>
> + file_hotplug_read_unlock(out);
> +fput_out:
> + fput_light(out, fput_out);
> +unlock_in:
> + file_hotplug_read_unlock(in);
> +fput_in:
> + fput_light(in, fput_in);
> +
> +out:
> return error;
> }
>
> @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,
>
> SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
> {
> - struct file *in;
> - int error, fput_in;
> + struct file *in, *out;
> + int error, fput_in, fput_out;
>
> if (unlikely(!len))
> return 0;
>
> error = -EBADF;
> in = fget_light(fdin, &fput_in);
> - if (in) {
> - if (in->f_mode & FMODE_READ) {
> - int fput_out;
> - struct file *out = fget_light(fdout, &fput_out);
> -
> - if (out) {
> - if (out->f_mode & FMODE_WRITE)
> - error = do_tee(in, out, len, flags);
> - fput_light(out, fput_out);
> - }
> - }
> - fput_light(in, fput_in);
> - }
> + if (!in)
> + goto out;
> +
> + if (!(in->f_mode & FMODE_READ))
> + goto unlock_in; <<<<<<<

Shouldn't this be
goto fput_in;

? btw, its confusing to have labels and variables with same name:
fput_in and fput_out. You may want to rename labels ?

>
> + error = -EIO;
> + if (!file_hotplug_read_trylock(in))
> + goto fput_in;
> +
> + error = -EBADF;
> + out = fget_light(fdout, &fput_out);
> + if (!out)
> + goto unlock_in;
> +
> + if (!(out->f_mode & FMODE_WRITE))
> + goto fput_out;
> +
> + if (!file_hotplug_read_trylock(out))
> + goto fput_out;
> +
> + error = do_tee(in, out, len, flags);
> +
> + file_hotplug_read_unlock(out);
> +fput_out:
> + fput_light(out, fput_out);
> +unlock_in:
> + file_hotplug_read_unlock(in);
> +fput_in:
> + fput_light(in, fput_in);
> +out:
> return error;
> }

Thanks,
Badari

2009-06-04 00:56:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

On Wed, 3 Jun 2009, Eric W. Biederman wrote:

> What code are you talking about?
>
> To the open path a few memory writes and a smp_wmb. No atomics and no
> spin lock/unlocks.
>
> Are you complaining because I retain the file_list?

Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin
lock/unlock couple present in __dentry_open() (same sort of the release
path)?
And that's only like 5% of the code touched by the new special handling of
the file operations structure (basically, every f_op access ends up being
wrapped by two atomic ops and other extra code).
The question, that I'd like to reiterate is, is this stuff really needed?
Anyway, my complaint ends here and I'll let others evaluate if merging
this patchset is worth the cost.


- Davide

2009-06-04 01:42:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock

Davide Libenzi <[email protected]> writes:

> On Wed, 3 Jun 2009, Eric W. Biederman wrote:
>
>> What code are you talking about?
>>
>> To the open path a few memory writes and a smp_wmb. No atomics and no
>> spin lock/unlocks.
>>
>> Are you complaining because I retain the file_list?
>
> Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin
> lock/unlock couple present in __dentry_open() (same sort of the release
> path)?

You might be remembering v1. In v2 I have operations like file_hotplug_read_trylock
that implement a lock but use an rcu like algorithm. So there are no atomic
operations involved with their associated pipeline stalls. Over my previous
version this made a reasonable performance benefit.

> And that's only like 5% of the code touched by the new special handling of
> the file operations structure (basically, every f_op access ends up being
> wrapped by two atomic ops and other extra code).

Yes there is a single extra wrapping of every file in the syscall path. So
we know that someone is using it.

> The question, that I'd like to reiterate is, is this stuff really needed?
> Anyway, my complaint ends here and I'll let others evaluate if merging
> this patchset is worth the cost.

Sure. My apologies for not answering that question earlier.

My perspective is that every subsystem that winds up supporting hotplug
hardware winds up rolling it's own version of something like this,
and they each have a different set of bugs.

So one generic version is definitely worth implementing.

Similarly there is a case for a generic revoke facility in the kernel.
Alan at least has made the case that there are certain security problems
that can not be solved in userspace without revoke.

>From an implementation point of view doing the generic implementation at
the vfs level has significant benefits.

The extra locking appears reasonable from a code maintenance and
comprehensibility point of view. A real pain to find all of the entry
points into the vfs, and get other code to use the right vfs helpers
they should always have been using but I am volunteering to do that
work.

The practical question I see is are the performance overheads of my
primitives low enough that I do not cause performance regressions
on anyone's fast path. As far as I have been able to measure is that
the performance overhead is low enough, because I have been able to
avoid the use of atomics and have been able to use fairly small code
with predictable branches. Which is why I pressed you to be certain
I understood where you are coming from.

Eric

2009-06-05 09:03:51

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Hi Eric,

Very interesting work.

On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code. Classic locking
> primitives and reference counting cause pipeline stalls, except for rcu
> which provides no ability to preventing reading a data structure while
> it is being updated.

Well, the simple solution to that is to add another level of indirection:

old:

fdtable -> file

new:

fdtable -> persistent_file -> file

Then it is possible to replace persistent_file->file with a revoked
one under RCU. This has the added advantage that it supports
arbitrary file replacements, not just ones which return EIO.

Another advantage is that dereferencing can normally be done "under
the hood" in fget()/fget_light(). Only code which wants to
permanently store a file pointer (like the SCM_RIGHTS thing) would
need to be aware of the extra complexity.

Would that work, do you think?

Thanks,
Miklos

2009-06-05 19:06:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file

Miklos Szeredi <[email protected]> writes:

> Hi Eric,
>
> Very interesting work.
>
> On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code. Classic locking
>> primitives and reference counting cause pipeline stalls, except for rcu
>> which provides no ability to preventing reading a data structure while
>> it is being updated.
>
> Well, the simple solution to that is to add another level of indirection:
>
> old:
>
> fdtable -> file
>
> new:
>
> fdtable -> persistent_file -> file
>
> Then it is possible to replace persistent_file->file with a revoked
> one under RCU. This has the added advantage that it supports
> arbitrary file replacements, not just ones which return EIO.
>
> Another advantage is that dereferencing can normally be done "under
> the hood" in fget()/fget_light(). Only code which wants to
> permanently store a file pointer (like the SCM_RIGHTS thing) would
> need to be aware of the extra complexity.
>
> Would that work, do you think?

Well I went down this path for a little while, and it has some good points.
Unfortunately it appears to be more costly.

fget() and friends are semantically very different my
file_hotplug_read_trylock and unlock. In fact there is very little
overlap. Which means that transparent to the vfs users doesn't
actually work.

We actually have more and less predictable places where we store files.

If there was actually a compelling case for being more general I would
certainly agree that splitting the file structure in two would be a
good deal. As it is that level of flexibility seems to be overkill.

Eric

2009-06-05 19:34:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/23] vfs: Generalize the file_list

Nick Piggin <[email protected]> writes:

>> fs_may_remount_ro and mark_files_ro have been modified to walk the
>> inode list to find all of the inodes and then to walk the file list
>> on those inodes. It can be a slightly longer walk as we frequently
>> cache inodes that we do not have open but the overall complexity
>> should be about the same,
>
> Well not really. I have a couple of orders of magnitude more cached
> inodes than open files here.

Good point.

>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
>> return mapping->i_mmap_writable != 0;
>> }
>>
>> +struct file_list {
>> + spinlock_t lock;
>> + struct list_head list;
>> +};
>> +
>> /*
>> * Use sequence counter to get consistent i_size on 32-bit processors.
>> */
>> @@ -764,6 +769,7 @@ struct inode {
>> struct list_head inotify_watches; /* watches on this inode */
>> struct mutex inotify_mutex; /* protects the watches list */
>> #endif
>> + struct file_list i_files;
>>
>> unsigned long i_state;
>> unsigned long dirtied_when; /* jiffies of first dirtying */
>> @@ -934,9 +940,15 @@ struct file {
>> unsigned long f_mnt_write_state;
>> #endif
>> };
>> -extern spinlock_t files_lock;
>> -#define file_list_lock() spin_lock(&files_lock);
>> -#define file_list_unlock() spin_unlock(&files_lock);
>> +
>> +static inline void file_list_lock(struct file_list *files)
>> +{
>> + spin_lock(&files->lock);
>> +}
>> +static inline void file_list_unlock(struct file_list *files)
>> +{
>> + spin_unlock(&files->lock);
>> +}
>
> I don't really like this. It's just a list head. Get rid of
> all these wrappers and crap I'd say. In fact, starting with my
> patch to unexport files_lock and remove these wrappers would
> be reasonable, wouldn't it?

I don't really mind killing the wrappers.

I do mind your patch because it makes the list going through
the tty's something very different. In my view of the world
that is the only use case is what I'm working to move up more
into the vfs layer. So orphaning it seems wrong.

> Increasing the size of the struct inode by 24 bytes hurts.
> Even when you decrapify it and can reuse i_lock or something,
> then it is still 16 bytes on 64-bit.

We can get it even smaller if we make it an hlist. A hlist_head is
only a single pointer. This size growth appears to be one of the
biggest weakness of the code.

> I haven't looked through all the patches... but this is to
> speed up a slowpath operation, isn't it? Or does revoke
> need to be especially performant?

This was more about simplicity rather than performance. The
performance gain is using a per inode lock instead of a global lock.
Which keeps cache lines from bouncing.

> So this patch is purely a perofrmance improvement? Then I think
> it needs to be justified with numbers and the downsides (bloating
> struct inode in particulra) to be changelogged.

Certainly the cost.

One of the things I have discovered since I wrote this patch is the
i_devices list. Which means we don't necessarily need to have heads
in places other than struct inode. A character device driver (aka the
tty code) can walk it's inode list and from each inode walk the file
list. I need to check the locking on that one.

If that simplification works we can move all maintenance of the file
list into the vfs and not need a separate file list concept. I will
take a look.

Eric

2009-06-05 19:37:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock

Badari Pulavarty <[email protected]> writes:

> On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
>> From: Eric W. Biederman <[email protected]>
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> fs/read_write.c | 28 +++++++++----
>> fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
>> 2 files changed, 95 insertions(+), 44 deletions(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 718baea..c473d74 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> goto out;
>> if (!(in_file->f_mode & FMODE_READ))
>> goto fput_in;
>> + retval = -EIO;
>> + if (!file_hotplug_read_trylock(in_file))
>> + goto fput_in;
>> retval = -EINVAL;
>> in_inode = in_file->f_path.dentry->d_inode;
>> if (!in_inode)
>> - goto fput_in;
>> + goto unlock_in;
>> if (!in_file->f_op || !in_file->f_op->splice_read)
>> - goto fput_in;
>> + goto unlock_in;
>> retval = -ESPIPE;
>> if (!ppos)
>> ppos = &in_file->f_pos;
>> else
>> if (!(in_file->f_mode & FMODE_PREAD))
>> - goto fput_in;
>> + goto unlock_in;
>> retval = rw_verify_area(READ, in_file, ppos, count);
>> if (retval < 0)
>> - goto fput_in;
>> + goto unlock_in;
>> count = retval;
>>
>> /*
>> @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> retval = -EBADF;
>> out_file = fget_light(out_fd, &fput_needed_out);
>> if (!out_file)
>> - goto fput_in;
>> + goto unlock_in;
>> if (!(out_file->f_mode & FMODE_WRITE))
>> goto fput_out;
>> + retval = -EIO;
>> + if (!file_hotplug_read_trylock(out_file))
>> + goto fput_out;
>> retval = -EINVAL;
>> if (!out_file->f_op || !out_file->f_op->sendpage)
>> - goto fput_out;
>> + goto unlock_out;
>> out_inode = out_file->f_path.dentry->d_inode;
>> retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
>> if (retval < 0)
>> - goto fput_out;
>> + goto unlock_out;
>> count = retval;
>>
>> if (!max)
>> @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> pos = *ppos;
>> retval = -EINVAL;
>> if (unlikely(pos < 0))
>> - goto fput_out;
>> + goto unlock_out;
>> if (unlikely(pos + count > max)) {
>> retval = -EOVERFLOW;
>> if (pos >= max)
>> - goto fput_out;
>> + goto unlock_out;
>> count = max - pos;
>> }
>>
>> @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> if (*ppos > max)
>> retval = -EOVERFLOW;
>>
>> +unlock_out:
>> + file_hotplug_read_unlock(out_file);
>> fput_out:
>> fput_light(out_file, fput_needed_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in_file);
>> fput_in:
>> fput_light(in_file, fput_needed_in);
>> out:
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 666953d..fc6b3a5 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
>>
>> error = -EBADF;
>> file = fget_light(fd, &fput);
>> - if (file) {
>> - if (file->f_mode & FMODE_WRITE)
>> - error = vmsplice_to_pipe(file, iov, nr_segs, flags);
>> - else if (file->f_mode & FMODE_READ)
>> - error = vmsplice_to_user(file, iov, nr_segs, flags);
>> + if (!file)
>> + goto out;
>>
>> - fput_light(file, fput);
>> - }
>> + if (!file_hotplug_read_trylock(file))
>> + goto fput_file;
>>
>> + if (file->f_mode & FMODE_WRITE)
>> + error = vmsplice_to_pipe(file, iov, nr_segs, flags);
>> + else if (file->f_mode & FMODE_READ)
>> + error = vmsplice_to_user(file, iov, nr_segs, flags);
>> +
>> + file_hotplug_read_unlock(file);
>> +fput_file:
>> + fput_light(file, fput);
>> +out:
>> return error;
>> }
>>
>> @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
>>
>> error = -EBADF;
>> in = fget_light(fd_in, &fput_in);
>> - if (in) {
>> - if (in->f_mode & FMODE_READ) {
>> - out = fget_light(fd_out, &fput_out);
>> - if (out) {
>> - if (out->f_mode & FMODE_WRITE)
>> - error = do_splice(in, off_in,
>> - out, off_out,
>> - len, flags);
>> - fput_light(out, fput_out);
>> - }
>> - }
>> + if (!in)
>> + goto out;
>>
>> - fput_light(in, fput_in);
>> - }
>> + if (!(in->f_mode & FMODE_READ))
>> + goto fput_in;
>> +
>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(in))
>> + goto fput_in;
>> +
>> + error = -EBADF;
>> + out = fget_light(fd_out, &fput_out);
>> + if (!out)
>> + goto unlock_in;
>> +
>> + if (!(out->f_mode & FMODE_WRITE))
>> + goto fput_out;
>> +
>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(out))
>> + goto fput_out;
>> +
>> + error = do_splice(in, off_in, out, off_out, len, flags);
>>
>> + file_hotplug_read_unlock(out);
>> +fput_out:
>> + fput_light(out, fput_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in);
>> +fput_in:
>> + fput_light(in, fput_in);
>> +
>> +out:
>> return error;
>> }
>>
>> @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,
>>
>> SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
>> {
>> - struct file *in;
>> - int error, fput_in;
>> + struct file *in, *out;
>> + int error, fput_in, fput_out;
>>
>> if (unlikely(!len))
>> return 0;
>>
>> error = -EBADF;
>> in = fget_light(fdin, &fput_in);
>> - if (in) {
>> - if (in->f_mode & FMODE_READ) {
>> - int fput_out;
>> - struct file *out = fget_light(fdout, &fput_out);
>> -
>> - if (out) {
>> - if (out->f_mode & FMODE_WRITE)
>> - error = do_tee(in, out, len, flags);
>> - fput_light(out, fput_out);
>> - }
>> - }
>> - fput_light(in, fput_in);
>> - }
>> + if (!in)
>> + goto out;
>> +
>> + if (!(in->f_mode & FMODE_READ))
>> + goto unlock_in; <<<<<<<
>
> Shouldn't this be
> goto fput_in;

Good point. That is a bug.

> ? btw, its confusing to have labels and variables with same name:
> fput_in and fput_out. You may want to rename labels ?

Mayhap. I didn't start that one, although I am clearly spreading
it around here. Do you have a better naming suggestion?

>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(in))
>> + goto fput_in;
>> +
>> + error = -EBADF;
>> + out = fget_light(fdout, &fput_out);
>> + if (!out)
>> + goto unlock_in;
>> +
>> + if (!(out->f_mode & FMODE_WRITE))
>> + goto fput_out;
>> +
>> + if (!file_hotplug_read_trylock(out))
>> + goto fput_out;
>> +
>> + error = do_tee(in, out, len, flags);
>> +
>> + file_hotplug_read_unlock(out);
>> +fput_out:
>> + fput_light(out, fput_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in);
>> +fput_in:
>> + fput_light(in, fput_in);
>> +out:
>> return error;
>> }


Eric

2009-06-06 08:03:59

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote:
>
> I found myself looking at the uio, seeing that it does not support pci
> hot-unplug, and thinking "Great yet another implementation of
> hotunplug logic that needs to be added".
>
> I decided to see what it would take to add a generic implementation of
> the code we have for supporting hot unplugging devices in sysfs, proc,
> sysctl, tty_io, and now almost in the tun driver.
>
> Not long after I touched the tun driver and made it safe to delete the
> network device while still holding it's file descriptor open I someone
> else touch the code adding a different feature and my careful work
> went up in flames. Which brought home another point at the best of it
> this is ultimately complex tricky code that subsystems should not need
> to worry about.
>
> What makes this even more interesting is that in the presence of pci
> hot-unplug it looks like most subsystems and most devices will have to
> deal with the issue one way or another.
>
> This infrastructure could also be used to implement both force
> unmounts and sys_revoke. When I could not think of a better name for
> I have drawn on that and used revoke.

To be honest, the longer I'm looking at it, the less I like the approach...
It really looks as if we'd be much better off with functionality sitting
in a set of library helpers to be used by instances that need this stuff.
Do we really want it for generic case?

Note that "we might someday implement real force-umount" doesn't count;
the same kind of arguments had been given nine years ago in case of AIO
("oh, sure, we'll eventually cover foo_get_block() too - it will all be
a state machine, fully asynchronous; whaddya mean 'it's not feasible'?").
Of course, it was _not_ feasible and had never been implemented.

Frankly, I very much suspect that force-umount is another case like that;
we'll need a *lot* of interesting cooperation from fs for that to work and
to be useful. I'd be delighted to be proven incorrect on that one, so
if you have anything serious in that direction, please share the details.

As for the patchset in the current form... Could you explain what's to prevent
POSIX locks and dnotify entries from outliving a struct file you'd revoked,
seeing that filp_close() will skip killing them in that case.

2009-06-06 08:08:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 02/23] vfs: Implement unpoll_file.

On Mon, Jun 01, 2009 at 02:50:27PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> During a revoke operation it is necessary to stop using all state that is managed
> by the underlying file operations implementation. The poll wait queue is one part
> of that state.

Erm... Seeing that drivers and filesystems tend to have fsckloads of
other state of their own, why do we treat that separately?

2009-06-08 09:41:53

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Sat, 6 Jun 2009, Al Viro wrote:
> Frankly, I very much suspect that force-umount is another case like that;
> we'll need a *lot* of interesting cooperation from fs for that to work and
> to be useful. I'd be delighted to be proven incorrect on that one, so
> if you have anything serious in that direction, please share the details.

Umm, not sure why we'd need cooperation from the fs. Simply wait for
the operation to exit the filesystem or driver. If it's a blocking
operation, send a signal to interrupt it.

Sure, filesystems and drivers have lots of state, but we don't need to
care about that, just like we don't need to care about it for
remounting read-only.

Thanks,
Miklos

2009-06-08 10:25:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

Miklos Szeredi wrote:
> On Sat, 6 Jun 2009, Al Viro wrote:
> > Frankly, I very much suspect that force-umount is another case like that;
> > we'll need a *lot* of interesting cooperation from fs for that to work and
> > to be useful. I'd be delighted to be proven incorrect on that one, so
> > if you have anything serious in that direction, please share the details.
>
> Umm, not sure why we'd need cooperation from the fs. Simply wait for
> the operation to exit the filesystem or driver. If it's a blocking
> operation, send a signal to interrupt it.

We could even include the internal signal in TASK_KILLABLE, so it
interrupts otherwise uninterruptible operations.

-- Jamie

2009-06-08 16:29:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote:
> On Sat, 6 Jun 2009, Al Viro wrote:
> > Frankly, I very much suspect that force-umount is another case like that;
> > we'll need a *lot* of interesting cooperation from fs for that to work and
> > to be useful. I'd be delighted to be proven incorrect on that one, so
> > if you have anything serious in that direction, please share the details.
>
> Umm, not sure why we'd need cooperation from the fs. Simply wait for
> the operation to exit the filesystem or driver. If it's a blocking
> operation, send a signal to interrupt it.

And making sure that operations *are* interruptible (and that we can cope
with $BIGNUM new failure exits correctly) does not qualify as cooperation?

2009-06-08 16:45:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, 8 Jun 2009, Al Viro wrote:
> On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote:
> > On Sat, 6 Jun 2009, Al Viro wrote:
> > > Frankly, I very much suspect that force-umount is another case like that;
> > > we'll need a *lot* of interesting cooperation from fs for that to work and
> > > to be useful. I'd be delighted to be proven incorrect on that one, so
> > > if you have anything serious in that direction, please share the details.
> >
> > Umm, not sure why we'd need cooperation from the fs. Simply wait for
> > the operation to exit the filesystem or driver. If it's a blocking
> > operation, send a signal to interrupt it.
>
> And making sure that operations *are* interruptible (and that we can cope
> with $BIGNUM new failure exits correctly) does not qualify as cooperation?

I'm still not getting what the problem is. AFAICS file operations are
either

a) non-interruptible but finish within a short time or
b) may block indefinitely but are interruptible (or at least killable).

Anything else is already problematic, resulting in processes "stuck in
D state".

Can you give a more concrete example about your worries?

Thanks,
Miklos

2009-06-08 17:50:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:

> I'm still not getting what the problem is. AFAICS file operations are
> either
>
> a) non-interruptible but finish within a short time or
> b) may block indefinitely but are interruptible (or at least killable).
>
> Anything else is already problematic, resulting in processes "stuck in
> D state".

Welcome to reality...

* bread() is non-interruptible
* so's copy_from_user()/copy_to_user()
* IO we are stuck upon _might_ be interruptible, but by sending a signal
to some other process

... just for starters. If you sign up for auditing the tree to eliminate
"something's stuck in D state", you are welcome to it. Mind you, you'll
have to audit filesystems for "doesn't check if metadata IO has failed"
first, but that _really_ needs to be done anyway. On the ongoing basis.

Drivers, of course, are even more interesting - looking through foo_ioctl()
instances is a wonderful way to lower pH in stomach, but that's on the
"we want revoke()" side of it.

2009-06-08 18:04:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2



On Mon, 8 Jun 2009, Al Viro wrote:
>
> Welcome to reality...
>
> * bread() is non-interruptible
> * so's copy_from_user()/copy_to_user()
> * IO we are stuck upon _might_ be interruptible, but by sending a signal
> to some other process

We can probably improve on these, though.

Like the copy_to/from_user thing. We might well be able to do that whole
"if it's a fatal signal, return early" thing.

So in the _general_ case - no, we probably can't fix things. But we could
likely at least improve in some common cases if we cared.

Linus

2009-06-08 18:51:25

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, Jun 08, 2009 at 11:01:51AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 8 Jun 2009, Al Viro wrote:
> >
> > Welcome to reality...
> >
> > * bread() is non-interruptible
> > * so's copy_from_user()/copy_to_user()
> > * IO we are stuck upon _might_ be interruptible, but by sending a signal
> > to some other process
>
> We can probably improve on these, though.
>
> Like the copy_to/from_user thing. We might well be able to do that whole
> "if it's a fatal signal, return early" thing.
>
> So in the _general_ case - no, we probably can't fix things. But we could
> likely at least improve in some common cases if we cared.

Sure, even though I'm not at all certain that copy_from_user() is that easy.
We can make locking current->mm in there interruptible, all right, but that's
only a part of the answer - even aside of the allocations, we'd need vma
->fault() interruptible as well, which leads to interruptible instances of
->readpage(), with all the fun _that_ would be.

2009-06-08 19:20:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2



On Mon, 8 Jun 2009, Al Viro wrote:
>
> Sure, even though I'm not at all certain that copy_from_user() is that easy.
> We can make locking current->mm in there interruptible, all right, but that's
> only a part of the answer - even aside of the allocations, we'd need vma
> ->fault() interruptible as well, which leads to interruptible instances of
> ->readpage(), with all the fun _that_ would be.

We already have all that - the NFS people wanted it.

More importantly, you don't actually need to interrupt readpage itself -
you just need to stop _waiting_ on it. So in your fault handler, just stop
waiting, and instead just return FAULT_RETRY or whatever.

Linus

2009-06-09 05:51:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, 8 Jun 2009, Al Viro wrote:
> On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:
>
> > I'm still not getting what the problem is. AFAICS file operations are
> > either
> >
> > a) non-interruptible but finish within a short time or
> > b) may block indefinitely but are interruptible (or at least killable).
> >
> > Anything else is already problematic, resulting in processes "stuck in
> > D state".
>
> Welcome to reality...
>
> * bread() is non-interruptible
> * so's copy_from_user()/copy_to_user()

And why should revoke(2) care? Just wait for the damn thing to
finish. Why exactly do these need to be interruptible?

Okay, if we want revoke or umount -f to be instantaneous then all that
needs to be taken care of. But does it *need* to be?

My idea of revoke is something like below:

- make sure no new operations are started on the file
- check state of tasks for ongoing operations, if interruptible send signal
- wait for all pending operations to finish
- kill file

Thanks,
Miklos

2009-06-09 06:23:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

Al Viro <[email protected]> writes:

> On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote:
>>
>> I found myself looking at the uio, seeing that it does not support pci
>> hot-unplug, and thinking "Great yet another implementation of
>> hotunplug logic that needs to be added".
>>
>> I decided to see what it would take to add a generic implementation of
>> the code we have for supporting hot unplugging devices in sysfs, proc,
>> sysctl, tty_io, and now almost in the tun driver.
>>
>> Not long after I touched the tun driver and made it safe to delete the
>> network device while still holding it's file descriptor open I someone
>> else touch the code adding a different feature and my careful work
>> went up in flames. Which brought home another point at the best of it
>> this is ultimately complex tricky code that subsystems should not need
>> to worry about.
>>
>> What makes this even more interesting is that in the presence of pci
>> hot-unplug it looks like most subsystems and most devices will have to
>> deal with the issue one way or another.
>>
>> This infrastructure could also be used to implement both force
>> unmounts and sys_revoke. When I could not think of a better name for
>> I have drawn on that and used revoke.
>
> To be honest, the longer I'm looking at it, the less I like the approach...
> It really looks as if we'd be much better off with functionality sitting
> in a set of library helpers to be used by instances that need this stuff.
> Do we really want it for generic case?

I think so. I do know I have seen enough weird cases actually being
used and not being done correctly we want a clean pattern for handling
the general case that works and is complete.

The problem seems to break up into several pieces.
- unmap support.
- Getting a list of the files that are open for an inode.
- Waking up interruptible sleepers.
- A test to see if we are executing any of the functions in
the file_operations structure. (needed before we can free state)
- Calling frelease and generally releasing of the state held by the
file.

It might be possible to solve the entire problem outside of the vfs

> Note that "we might someday implement real force-umount" doesn't count;
> the same kind of arguments had been given nine years ago in case of AIO
> ("oh, sure, we'll eventually cover foo_get_block() too - it will all be
> a state machine, fully asynchronous; whaddya mean 'it's not feasible'?").
> Of course, it was _not_ feasible and had never been implemented.

> Frankly, I very much suspect that force-umount is another case like that;
> we'll need a *lot* of interesting cooperation from fs for that to work and
> to be useful. I'd be delighted to be proven incorrect on that one, so
> if you have anything serious in that direction, please share the details.

So far nothing but thought experiments, but you have a good point at
least a proof of concept should be done of the various pieces. To flush
out some niggling little detail that messes up the design.

So I hereby sign up for writing a sys_revoke patch, a forced umount patch
and a writing a patch to ext2 to support it. Supporting proc and
sysfs while easy is not really the common case of an nfs exportable block
filesystem so it is not complete.

> As for the patchset in the current form... Could you explain what's to prevent
> POSIX locks and dnotify entries from outliving a struct file you'd revoked,
> seeing that filp_close() will skip killing them in that case.

Good catch that looks like a big fat bug to me. It seems I overlooked
the fact that we actually free things in filp_close.

Given that posix_remove_file calls vfs_lock_file which calls
file->f_op->lock it looks like something really needs to be done here.

dnotify_flush doesn't look to hard to spin a special case for revoke.

I am going to have to spend I while longer studying the rest of the
code in filp_close. I hope I don't need to figure out the various
fl_owner_t values to safely revoke a file, but it looks like I might.

Eric

2009-06-09 06:31:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

Miklos Szeredi <[email protected]> writes:

> On Mon, 8 Jun 2009, Al Viro wrote:
>> On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:
>>
>> > I'm still not getting what the problem is. AFAICS file operations are
>> > either
>> >
>> > a) non-interruptible but finish within a short time or
>> > b) may block indefinitely but are interruptible (or at least killable).
>> >
>> > Anything else is already problematic, resulting in processes "stuck in
>> > D state".
>>
>> Welcome to reality...
>>
>> * bread() is non-interruptible
>> * so's copy_from_user()/copy_to_user()
>
> And why should revoke(2) care? Just wait for the damn thing to
> finish. Why exactly do these need to be interruptible?

Agreed. I expect the data size is going to be a page or less. Which
is at most 64K on some weird architectures. I think that counts as a
short time waiting for disk I/O. Baring thrashing.

> Okay, if we want revoke or umount -f to be instantaneous then all that
> needs to be taken care of. But does it *need* to be?

Good question. I wonder what umount -f needs when we yank out a usb drive.

> My idea of revoke is something like below:
>
> - make sure no new operations are started on the file
> - check state of tasks for ongoing operations, if interruptible send signal

Figuring out who to send a signal to is tricky. Still it should be doable
in the common case.

> - wait for all pending operations to finish
> - kill file

Eric

2009-06-09 06:43:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

Linus Torvalds <[email protected]> writes:

> On Mon, 8 Jun 2009, Al Viro wrote:
>>
>> Sure, even though I'm not at all certain that copy_from_user() is that easy.
>> We can make locking current->mm in there interruptible, all right, but that's
>> only a part of the answer - even aside of the allocations, we'd need vma
>> ->fault() interruptible as well, which leads to interruptible instances of
>> ->readpage(), with all the fun _that_ would be.
>
> We already have all that - the NFS people wanted it.
>
> More importantly, you don't actually need to interrupt readpage itself -
> you just need to stop _waiting_ on it. So in your fault handler, just stop
> waiting, and instead just return FAULT_RETRY or whatever.

That sounds doable. Has that code been merged yet?

I took a quick look and it didn't see anyone breaking out of page fault with a
signal or code to really handle that.

Eric

2009-06-09 10:41:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 03/23] vfs: Generalize the file_list

On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
>
> >> +static inline void file_list_unlock(struct file_list *files)
> >> +{
> >> + spin_unlock(&files->lock);
> >> +}
> >
> > I don't really like this. It's just a list head. Get rid of
> > all these wrappers and crap I'd say. In fact, starting with my
> > patch to unexport files_lock and remove these wrappers would
> > be reasonable, wouldn't it?
>
> I don't really mind killing the wrappers.
>
> I do mind your patch because it makes the list going through
> the tty's something very different. In my view of the world
> that is the only use case is what I'm working to move up more
> into the vfs layer. So orphaning it seems wrong.

My patch doesn't orphan it, it just makes the locking more
explicit and that's all so it should be easier to work with.
I just mean start with my patch and you could change things
as needed.


> > Increasing the size of the struct inode by 24 bytes hurts.
> > Even when you decrapify it and can reuse i_lock or something,
> > then it is still 16 bytes on 64-bit.
>
> We can get it even smaller if we make it an hlist. A hlist_head is
> only a single pointer. This size growth appears to be one of the
> biggest weakness of the code.

8 bytes would be a lot better than 24.


> > I haven't looked through all the patches... but this is to
> > speed up a slowpath operation, isn't it? Or does revoke
> > need to be especially performant?
>
> This was more about simplicity rather than performance. The
> performance gain is using a per inode lock instead of a global lock.
> Which keeps cache lines from bouncing.

Yes but we already have such a global lock which has been
OK until now. Granted that some users are running into these
locks, but fine graining them can be considered independently
I think. So using per-sb lists of files and not bloating
struct inode any more could be a less controversial step
for you.


> > So this patch is purely a perofrmance improvement? Then I think
> > it needs to be justified with numbers and the downsides (bloating
> > struct inode in particulra) to be changelogged.
>
> Certainly the cost.
>
> One of the things I have discovered since I wrote this patch is the
> i_devices list. Which means we don't necessarily need to have heads
> in places other than struct inode. A character device driver (aka the
> tty code) can walk it's inode list and from each inode walk the file
> list. I need to check the locking on that one.
>
> If that simplification works we can move all maintenance of the file
> list into the vfs and not need a separate file list concept. I will
> take a look.

Thanks,
Nick

2009-06-09 10:53:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2

On Mon, Jun 08, 2009 at 11:42:53PM -0700, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Mon, 8 Jun 2009, Al Viro wrote:
> >>
> >> Sure, even though I'm not at all certain that copy_from_user() is that easy.
> >> We can make locking current->mm in there interruptible, all right, but that's
> >> only a part of the answer - even aside of the allocations, we'd need vma
> >> ->fault() interruptible as well, which leads to interruptible instances of
> >> ->readpage(), with all the fun _that_ would be.
> >
> > We already have all that - the NFS people wanted it.
> >
> > More importantly, you don't actually need to interrupt readpage itself -
> > you just need to stop _waiting_ on it. So in your fault handler, just stop
> > waiting, and instead just return FAULT_RETRY or whatever.
>
> That sounds doable. Has that code been merged yet?
>
> I took a quick look and it didn't see anyone breaking out of page fault with a
> signal or code to really handle that.

The problem is get_user_pages I think. Now that we have a good number of
fault flags, we can pass down whether the caller is able to be
interrupted or not.

Ben H had some interest in doing this, but I don't know how far he got
with it.

2009-06-09 18:39:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/23] vfs: Generalize the file_list

Nick Piggin <[email protected]> writes:

> On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
>> Nick Piggin <[email protected]> writes:
>>
>> >> +static inline void file_list_unlock(struct file_list *files)
>> >> +{
>> >> + spin_unlock(&files->lock);
>> >> +}
>> >
>> > I don't really like this. It's just a list head. Get rid of
>> > all these wrappers and crap I'd say. In fact, starting with my
>> > patch to unexport files_lock and remove these wrappers would
>> > be reasonable, wouldn't it?
>>
>> I don't really mind killing the wrappers.
>>
>> I do mind your patch because it makes the list going through
>> the tty's something very different. In my view of the world
>> that is the only use case is what I'm working to move up more
>> into the vfs layer. So orphaning it seems wrong.
>
> My patch doesn't orphan it, it just makes the locking more
> explicit and that's all so it should be easier to work with.
> I just mean start with my patch and you could change things
> as needed.

As I recall you weren't using the files_lock for the tty layer. I
seem to recall you were still walking through the same list head on
struct file.

Regardless it sure felt like pushing the tty usage out into
some weird special case. My goal is to make it reasonable for
more character drivers to use the list so it isn't an especially
comfortable starting place for me.

>> > Increasing the size of the struct inode by 24 bytes hurts.
>> > Even when you decrapify it and can reuse i_lock or something,
>> > then it is still 16 bytes on 64-bit.
>>
>> We can get it even smaller if we make it an hlist. A hlist_head is
>> only a single pointer. This size growth appears to be one of the
>> biggest weakness of the code.
>
> 8 bytes would be a lot better than 24.

Definitely.

>> > I haven't looked through all the patches... but this is to
>> > speed up a slowpath operation, isn't it? Or does revoke
>> > need to be especially performant?
>>
>> This was more about simplicity rather than performance. The
>> performance gain is using a per inode lock instead of a global lock.
>> Which keeps cache lines from bouncing.
>
> Yes but we already have such a global lock which has been
> OK until now. Granted that some users are running into these
> locks, but fine graining them can be considered independently
> I think. So using per-sb lists of files and not bloating
> struct inode any more could be a less controversial step
> for you.

I will take a look. Certainly doing the work in a couple
of patches seems reasonable. If I can move all of the list
maintenance out of the tty layer. That looks to be the ideal
case.

Eric

2009-06-10 06:05:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 03/23] vfs: Generalize the file_list

On Tue, Jun 09, 2009 at 11:38:59AM -0700, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
>
> > On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
> >> Nick Piggin <[email protected]> writes:
> >>
> >> >> +static inline void file_list_unlock(struct file_list *files)
> >> >> +{
> >> >> + spin_unlock(&files->lock);
> >> >> +}
> >> >
> >> > I don't really like this. It's just a list head. Get rid of
> >> > all these wrappers and crap I'd say. In fact, starting with my
> >> > patch to unexport files_lock and remove these wrappers would
> >> > be reasonable, wouldn't it?
> >>
> >> I don't really mind killing the wrappers.
> >>
> >> I do mind your patch because it makes the list going through
> >> the tty's something very different. In my view of the world
> >> that is the only use case is what I'm working to move up more
> >> into the vfs layer. So orphaning it seems wrong.
> >
> > My patch doesn't orphan it, it just makes the locking more
> > explicit and that's all so it should be easier to work with.
> > I just mean start with my patch and you could change things
> > as needed.
>
> As I recall you weren't using the files_lock for the tty layer. I
> seem to recall you were still walking through the same list head on
> struct file.
>
> Regardless it sure felt like pushing the tty usage out into
> some weird special case. My goal is to make it reasonable for
> more character drivers to use the list so it isn't an especially
> comfortable starting place for me.

I don't see the problem. It made files_lock for filesystems
and uses another lock for tty. Tty is a special case (or
different case) compared with filesystem, and how did it
make it unreasonable for character drivers to use the list?

Mandating the locking and list to be in the inode for
everyone is just bloating things up.


> >> > Increasing the size of the struct inode by 24 bytes hurts.
> >> > Even when you decrapify it and can reuse i_lock or something,
> >> > then it is still 16 bytes on 64-bit.
> >>
> >> We can get it even smaller if we make it an hlist. A hlist_head is
> >> only a single pointer. This size growth appears to be one of the
> >> biggest weakness of the code.
> >
> > 8 bytes would be a lot better than 24.
>
> Definitely.
>
> >> > I haven't looked through all the patches... but this is to
> >> > speed up a slowpath operation, isn't it? Or does revoke
> >> > need to be especially performant?
> >>
> >> This was more about simplicity rather than performance. The
> >> performance gain is using a per inode lock instead of a global lock.
> >> Which keeps cache lines from bouncing.
> >
> > Yes but we already have such a global lock which has been
> > OK until now. Granted that some users are running into these
> > locks, but fine graining them can be considered independently
> > I think. So using per-sb lists of files and not bloating
> > struct inode any more could be a less controversial step
> > for you.
>
> I will take a look. Certainly doing the work in a couple
> of patches seems reasonable. If I can move all of the list
> maintenance out of the tty layer. That looks to be the ideal
> case.

I will wait to see. It will be nice if you have any obvious
standalone fixes or improvements then to post them first or
in front of your patchset: I'd like to make some progress
here too to help my locking patchset.