Hell Linus, Al and everyone,
First off, sorry for this pull request being late in the merge window. Al
had raised a couple of concerns about 2 items in the series below. I
addressed the first issue (the race introduced by Gu's use of mm_populate()),
but he has not provided any further details on how he wants to rework the
anon_inode.c changes (which were sent out months ago but have yet to be
commented on). The bulk of the changes have been sitting in the -next tree
for a few months, with all the issues raised being addressed. Please
consider this pull. Thanks,
-ben
The following changes since commit 47188d39b5deeebf41f87a02af1b3935866364cf:
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2013-07-14 21:47:51 -0700)
are available in the git repository at:
git://git.kvack.org/~bcrl/aio-next.git master
for you to fetch changes up to d9b2c8714aef102dea95544a8cd9372b21af463f:
aio: rcu_read_lock protection for new rcu_dereference calls (2013-09-09 12:29:35 -0400)
----------------------------------------------------------------
Artem Savkov (1):
aio: rcu_read_lock protection for new rcu_dereference calls
Benjamin LaHaise (9):
aio: fix build when migration is disabled
aio: double aio_max_nr in calculations
aio: convert the ioctx list to table lookup v3
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: table lookup: verify ctx pointer
aio: remove unnecessary debugging from aio_free_ring()
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: fix race in ring buffer page lookup introduced by page migration support
Gu Zheng (2):
fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()
fs/aio: Add support to aio ring pages migration
Kent Overstreet (9):
aio: reqs_active -> reqs_available
aio: percpu reqs_available
aio: percpu ioctx refcount
aio: io_cancel() no longer returns the io_event
aio: Don't use ctx->tail unnecessarily
aio: Kill aio_rw_vect_retry()
aio: Kill unneeded kiocb members
aio: Kill ki_users
aio: Kill ki_dtor
Peng Tao (1):
staging/lustre: kiocb->ki_left is removed
drivers/staging/android/logger.c | 2 +-
drivers/staging/lustre/lustre/llite/file.c | 4 +-
drivers/usb/gadget/inode.c | 9 +-
fs/aio.c | 726 ++++++++++++++++++-----------
fs/anon_inodes.c | 66 +++
fs/block_dev.c | 2 +-
fs/nfs/direct.c | 1 -
fs/ocfs2/file.c | 6 +-
fs/read_write.c | 3 -
fs/udf/file.c | 2 +-
include/linux/aio.h | 21 +-
include/linux/anon_inodes.h | 3 +
include/linux/migrate.h | 3 +
include/linux/mm_types.h | 5 +-
kernel/fork.c | 2 +-
mm/migrate.c | 2 +-
mm/page_io.c | 1 -
net/socket.c | 15 +-
18 files changed, 561 insertions(+), 312 deletions(-)
On Fri, Sep 13, 2013 at 12:59:37PM -0400, Benjamin LaHaise wrote:
> Hell Linus, Al and everyone,
>
> First off, sorry for this pull request being late in the merge window. Al
> had raised a couple of concerns about 2 items in the series below. I
> addressed the first issue (the race introduced by Gu's use of mm_populate()),
> but he has not provided any further details on how he wants to rework the
> anon_inode.c changes (which were sent out months ago but have yet to be
> commented on). The bulk of the changes have been sitting in the -next tree
> for a few months, with all the issues raised being addressed. Please
> consider this pull. Thanks,
OK... As for objections against anon_inodes.c stuff, it can be dealt with
after merge. Basically, I don't like using anon_inodes as a dumping ground -
look how little of what that sucker is doing has anything to do with the
code in anon_inodes.c; you override practically everything anyway. It's
just a "filesystems are hard, let's go shopping". Look, declaring an
fs takes about 20 lines. Total. All you really use from anon_inodes.c is
{
struct inode *inode = new_inode_pseudo(s);
if (!inode)
return ERR_PTR(-ENOMEM);
inode->i_ino = get_next_ino();
inode->i_state = I_DIRTY;
inode->i_mode = S_IRUSR | S_IWUSR;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
inode->i_flags |= S_PRIVATE;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
return inode;
}
which can bloody well go into fs/inode.c - it has nothing whatsoever
anon_inode-specific. You end up overriding ->a_ops anyway. Moreover,
your "allocate a file/dentry/inode and give it to me" helper creates
a struct file that needs to be patched up by caller. What's the point
of passing ctx to anon_inode_getfile_private(), then? And the same
will happen for any extra callers that API might grow.
Look, defining an fs is as simple as this:
struct vfsmount *aio_mnt;
static struct dentry *aio_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
static const struct dentry_operations ops = {
.d_dname = simple_dname,
};
return mount_pseudo(fs_type, "aio:", NULL,
&ops, 0x69696969);
}
and in aio_setup() do this
static struct file_system_type aiofs = {
.name = "aio",
.mount = aio_mount,
.kill_sb = kill_anon_super,
};
aio_mnt = kern_mount(&aio_fs);
if (IS_ERR(aio_mnt))
panic("buggered");
That's all the glue you need. Again, the proper solution is to take
fs-independent parts of anon_inode_mkinode() to fs/inode.c (there's a
lot of open-coded variants floating around in the tree, BTW) and do
what anon_inode_getfile_private() is trying to do right in aio.c.
With the patch-up you have to do afterwards folded in. Look at what
it's doing, really.
* allocate an inode, with uid/gid/ino/timestamps set in
usual way. Should be fs/inode.c helper.
* set the rest of it up (size, a_ops, ->mapping->private_data) -
the things you open-code anyway
* d_alloc_pseudo() on constant name ("anon_inode:[aio]")
* d_instantiate()
* mntget()
* alloc_file(), with &aio_ring_fops passed to it
* set file->private_data (unused)
It might make sense to add a helper for steps 3--5 (something along the
lines of alloc_pseudo_file(mnt, name, inode, mode, fops)). Step 6 is
useless, AFAICS.
Note that anon_inodes.c reason to exist was "it's for situations where
all context lives on struct file and we don't need separate inode for
them". Going from that to "it happens to contain a handy function for inode
allocation"...
Hi Al,
On Fri, Sep 13, 2013 at 07:42:04PM +0100, Al Viro wrote:
> OK... As for objections against anon_inodes.c stuff, it can be dealt with
> after merge. Basically, I don't like using anon_inodes as a dumping ground -
> look how little of what that sucker is doing has anything to do with the
> code in anon_inodes.c; you override practically everything anyway. It's
> just a "filesystems are hard, let's go shopping". Look, declaring an
> fs takes about 20 lines. Total. All you really use from anon_inodes.c is
...
> Note that anon_inodes.c reason to exist was "it's for situations where
> all context lives on struct file and we don't need separate inode for
> them". Going from that to "it happens to contain a handy function for inode
> allocation"...
The main reason for re-using anon_inodes.c was more to avoid duplicating
code. In any case, the below reworks things as suggested, and it seems to
work in basic testing (the migrate pages test passes, as well as some basic
operations generating events). Could you please review changes below? If
it looks okay, I'll add it to my next bug fix pull. Credit goes to Al for
having written most of this code in his previous email.
-ben
aio.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 129 insertions(+), 7 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 6b868f0..3acca84 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -36,10 +36,11 @@
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
-#include <linux/anon_inodes.h>
#include <linux/migrate.h>
#include <linux/ramfs.h>
#include <linux/percpu-refcount.h>
+#include <linux/module.h>
+#include <linux/mount.h>
#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -152,12 +153,138 @@ unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio request
static struct kmem_cache *kiocb_cachep;
static struct kmem_cache *kioctx_cachep;
+static struct vfsmount *aio_mnt;
+
+static const struct file_operations aio_ring_fops;
+
+static int aio_set_page_dirty(struct page *page)
+{
+ return 0;
+};
+
+static const struct address_space_operations aio_aops = {
+ .set_page_dirty = aio_set_page_dirty,
+};
+
+/*
+ * A single inode exists for each aio_inode file. The inodes are only
+ * used for mapping the event ring buffers in order to make it possible
+ * to provide migration ops to the vm.
+ */
+static struct inode *aio_inode_mkinode(struct super_block *s)
+{
+ struct inode *inode = new_inode_pseudo(s);
+
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ inode->i_ino = get_next_ino();
+ inode->i_fop = &aio_ring_fops;
+ inode->i_mapping->a_ops = &aio_aops;
+
+ /*
+ * Mark the inode dirty from the very beginning,
+ * that way it will never be moved to the dirty
+ * list because mark_inode_dirty() will think
+ * that it already _is_ on the dirty list.
+ */
+ inode->i_state = I_DIRTY;
+ inode->i_mode = S_IRUSR | S_IWUSR;
+ inode->i_uid = current_fsuid();
+ inode->i_gid = current_fsgid();
+ inode->i_flags |= S_PRIVATE;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ return inode;
+}
+
+/**
+ * aio_inode_getfile_private - creates a new file instance by hooking it up to
+ * an anonymous inode, and a dentry that describe the "class" of the file.
+ *
+ * @name: [in] name of the "class" of the new file
+ * @fops: [in] file operations for the new file
+ * @priv: [in] private data for the new file (will be file's private_data)
+ * @flags: [in] flags
+ *
+ *
+ * Similar to aio_inode_getfile, but each file holds a single inode.
+ *
+ */
+struct file *aio_inode_getfile_private(const char *name,
+ const struct file_operations *fops,
+ void *priv, int flags)
+{
+ struct qstr this;
+ struct path path;
+ struct file *file;
+ struct inode *inode;
+
+ if (fops->owner && !try_module_get(fops->owner))
+ return ERR_PTR(-ENOENT);
+
+ inode = aio_inode_mkinode(aio_mnt->mnt_sb);
+ if (IS_ERR(inode)) {
+ file = ERR_PTR(-ENOMEM);
+ goto err_module;
+ }
+
+ /*
+ * Link the inode to a directory entry by creating a unique name
+ * using the inode sequence number.
+ */
+ file = ERR_PTR(-ENOMEM);
+ this.name = name;
+ this.len = strlen(name);
+ this.hash = 0;
+ path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
+ if (!path.dentry)
+ goto err_module;
+
+ path.mnt = mntget(aio_mnt);
+
+ d_instantiate(path.dentry, inode);
+
+ file = alloc_file(&path, OPEN_FMODE(flags), fops);
+ if (IS_ERR(file))
+ goto err_dput;
+
+ file->f_mapping = inode->i_mapping;
+ file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
+ file->private_data = priv;
+
+ return file;
+
+err_dput:
+ path_put(&path);
+err_module:
+ module_put(fops->owner);
+ return file;
+}
+
+static struct dentry *aio_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ static const struct dentry_operations ops = {
+ .d_dname = simple_dname,
+ };
+ return mount_pseudo(fs_type, "aio:", NULL, &ops, 0xa10a10a1);
+}
+
/* aio_setup
* Creates the slab caches used by the aio routines, panic on
* failure as this is done early during the boot sequence.
*/
static int __init aio_setup(void)
{
+ static struct file_system_type aio_fs = {
+ .name = "aio",
+ .mount = aio_mount,
+ .kill_sb = kill_anon_super,
+ };
+ aio_mnt = kern_mount(&aio_fs);
+ if (IS_ERR(aio_mnt))
+ panic("Failed to create aio fs mount.");
+
kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
@@ -198,11 +325,6 @@ static const struct file_operations aio_ring_fops = {
.mmap = aio_ring_mmap,
};
-static int aio_set_page_dirty(struct page *page)
-{
- return 0;
-}
-
#if IS_ENABLED(CONFIG_MIGRATION)
static int aio_migratepage(struct address_space *mapping, struct page *new,
struct page *old, enum migrate_mode mode)
@@ -260,7 +382,7 @@ static int aio_setup_ring(struct kioctx *ctx)
if (nr_pages < 0)
return -EINVAL;
- file = anon_inode_getfile_private("[aio]", &aio_ring_fops, ctx, O_RDWR);
+ file = aio_inode_getfile_private("[aio]", &aio_ring_fops, ctx, O_RDWR);
if (IS_ERR(file)) {
ctx->aio_ring_file = NULL;
return -EAGAIN;
--
"Thought is the essence of where you are now."
On Tue, Sep 17, 2013 at 10:18:25AM -0400, Benjamin LaHaise wrote:
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
> +};
> +
> +static const struct address_space_operations aio_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> +};
> +
> +/*
> + * A single inode exists for each aio_inode file. The inodes are only
> + * used for mapping the event ring buffers in order to make it possible
> + * to provide migration ops to the vm.
> + */
> +static struct inode *aio_inode_mkinode(struct super_block *s)
> +{
> + struct inode *inode = new_inode_pseudo(s);
> +
> + if (!inode)
> + return ERR_PTR(-ENOMEM);
> +
> + inode->i_ino = get_next_ino();
> + inode->i_fop = &aio_ring_fops;
> + inode->i_mapping->a_ops = &aio_aops;
> +
> + /*
> + * Mark the inode dirty from the very beginning,
> + * that way it will never be moved to the dirty
> + * list because mark_inode_dirty() will think
> + * that it already _is_ on the dirty list.
> + */
> + inode->i_state = I_DIRTY;
> + inode->i_mode = S_IRUSR | S_IWUSR;
> + inode->i_uid = current_fsuid();
> + inode->i_gid = current_fsgid();
> + inode->i_flags |= S_PRIVATE;
> + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> + return inode;
> +}
FWIW, I would've taken that to fs/libfs.c, sans the assignment of ->i_fop.
BTW, are you sure that you want it to be opened via procfs symlink? Is that
even possible? IOW, what's that ->i_fop for?
> +struct file *aio_inode_getfile_private(const char *name,
> + const struct file_operations *fops,
> + void *priv, int flags)
Why is it not static? And why bother with fops as argument, etc.?
> + if (fops->owner && !try_module_get(fops->owner))
> + return ERR_PTR(-ENOENT);
Also pointless, AFAICS. BTW, you've just open-coded fops_get(fops), not
that it mattered in this case...
> + inode = aio_inode_mkinode(aio_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + file = ERR_PTR(-ENOMEM);
> + goto err_module;
> + }
> +
> + /*
> + * Link the inode to a directory entry by creating a unique name
> + * using the inode sequence number.
> + */
> + file = ERR_PTR(-ENOMEM);
> + this.name = name;
> + this.len = strlen(name);
> + this.hash = 0;
Umm... ITYM
struct qstr this = QSTR_INIT("[aio]", 5);
, if not
path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb,
&(struct qstr)QSTR_INIT("[aio]", 5));
> + if (!path.dentry)
> + goto err_module;
> +
> + path.mnt = mntget(aio_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + file = alloc_file(&path, OPEN_FMODE(flags), fops);
> + if (IS_ERR(file))
> + goto err_dput;
> + file->f_mapping = inode->i_mapping;
Pointless, BTW - alloc_file() will have already set it that way.
> + file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> + file->private_data = priv;
> +
> + return file;
> +
> +err_dput:
> + path_put(&path);
> +err_module:
> + module_put(fops->owner);
And that module_put is pointless as well here.
On Thu, Oct 03, 2013 at 03:22:51AM +0100, Al Viro wrote:
> > + /*
> > + * Link the inode to a directory entry by creating a unique name
> > + * using the inode sequence number.
> > + */
> > + file = ERR_PTR(-ENOMEM);
> > + this.name = name;
> > + this.len = strlen(name);
> > + this.hash = 0;
>
> Umm... ITYM
> struct qstr this = QSTR_INIT("[aio]", 5);
> , if not
> path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb,
> &(struct qstr)QSTR_INIT("[aio]", 5));
> > + if (!path.dentry)
> > + goto err_module;
and you've leaked that inode, BTW.
Another thing: I'd rather pull everything about setting the inode
up (aops, size, etc.) in there.
Anyway, could you take a look at the last couple of commits in
vfs.git#for-bcrl? Commit message on the last one is an atrocity,
of course...
Hi Al,
On Thu, Oct 03, 2013 at 03:50:07AM +0100, Al Viro wrote:
...
> Another thing: I'd rather pull everything about setting the inode
> up (aops, size, etc.) in there.
>
> Anyway, could you take a look at the last couple of commits in
> vfs.git#for-bcrl? Commit message on the last one is an atrocity,
> of course...
Sorry for not replying sooner, I'd actually reviewed this last week but got
distracted before replying.
This looks much better in terms of the code that is specific to aio. I've
run the aio tests I wrote for the numa page migration test case, and they
look okay with your changes from vfs.git@for-bcrl. Please feel free to
add my Tested/Acked-by.
Tested-by: Benjamin LaHaise <[email protected]>
Acked-by: Benjamin LaHaise <[email protected]>
-ben
--
"Thought is the essence of where you are now."