2013-03-21 16:36:07

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 00/33] AIO cleanups/performance improvements

This is a respin of the AIO patches that have been in Andrew's tree,
with all the various fixes squashed.

Two differences from the code that was in Andrew's tree:

* The "block: Prep work for batch completion" patch is new -
previously, the batch completion stuff added a separate
bi_batch_end_io, this now adds the struct batch_complete * argument
to bi_end_io.

* When I went to squash the "aio: fix ringbuffer calculation so we
don't wrap" patch
http://atlas.evilpiepirate.org/git/linux-bcache.git/commit/?h=aio-upstream-v0&id=790a3cec8322c4e07704e9356495acdf6ee6aff4
I realized it unintentionally changed behaviour from upstream - so I
redid it correctly, and added some comments.

Here's the output of git diff between the two branches (excluding the
"prep work for batch completion" patch)

diff --git a/fs/aio.c b/fs/aio.c
index 33e9db3..d2c1a82 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -75,20 +75,22 @@ struct kioctx {

struct __percpu kioctx_cpu *cpu;

- /* Size of ringbuffer, in units of struct io_event */
- unsigned nr_events;
-
- /*
- * Maximum number of outstanding requests:
- * sys_io_setup currently limits this to an unsigned int
- */
- unsigned max_reqs;
-
/*
* For percpu reqs_available, number of slots we move to/from global
* counter at a time:
*/
unsigned req_batch;
+ /*
+ * This is what userspace passed to io_setup(), it's not used for
+ * anything but counting against the global max_reqs quota.
+ *
+ * The real limit is nr_events - 1, which will be larger (see
+ * aio_setup_ring())
+ */
+ unsigned max_reqs;
+
+ /* Size of ringbuffer, in units of struct io_event */
+ unsigned nr_events;

unsigned long mmap_base;
unsigned long mmap_size;
@@ -121,21 +123,20 @@ struct kioctx {
wait_queue_head_t wait;

/*
- * Copy of the real tail, that aio_complete uses - to reduce
- * cacheline bouncing. The real tail will tend to be much more
- * contended - since typically events are delivered one at a
- * time, and then aio_read_events() slurps them up a bunch at a
- * time - so it's helpful if aio_read_events() isn't also
- * contending for the tail. So, aio_complete() updates
- * shadow_tail whenever it updates tail.
- *
- * Also needed because tail is used as a hacky lock and isn't
- * always the real tail.
+ * Copy of the real tail - to reduce cacheline bouncing. Updated
+ * by aio_complete() whenever it updates the real tail.
*/
unsigned shadow_tail;
} ____cacheline_aligned_in_smp;

struct {
+ /*
+ * This is the canonical copy of the tail pointer, updated by
+ * aio_complete(). But aio_complete() also uses it as a lock, so
+ * other code can't use it; aio_complete() keeps shadow_tail in
+ * sync with the real value of the tail pointer for other code
+ * to use.
+ */
unsigned tail;
} ____cacheline_aligned_in_smp;

@@ -347,20 +348,20 @@ static void free_ioctx(struct kioctx *ctx)
head = ring->head;
kunmap_atomic(ring);

- while (atomic_read(&ctx->reqs_available) < ctx->max_reqs) {
+ while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
wait_event(ctx->wait,
(head != ctx->shadow_tail) ||
- (atomic_read(&ctx->reqs_available) >= ctx->max_reqs));
+ (atomic_read(&ctx->reqs_available) >= ctx->nr_events - 1));

- avail = (head <= ctx->shadow_tail ?
- ctx->shadow_tail : ctx->nr_events) - head;
+ avail = (head <= ctx->shadow_tail
+ ? ctx->shadow_tail : ctx->nr_events) - head;

atomic_add(avail, &ctx->reqs_available);
head += avail;
head %= ctx->nr_events;
}

- WARN_ON(atomic_read(&ctx->reqs_available) > ctx->max_reqs);
+ WARN_ON(atomic_read(&ctx->reqs_available) > ctx->nr_events - 1);

aio_free_ring(ctx);

@@ -423,8 +424,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-ENOMEM);

ctx->max_reqs = nr_events;
- atomic_set(&ctx->reqs_available, nr_events);
- ctx->req_batch = nr_events / (num_possible_cpus() * 4);

percpu_ref_init(&ctx->users);
rcu_read_lock();
@@ -444,6 +443,10 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (aio_setup_ring(ctx) < 0)
goto out_freepcpu;

+ atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
+ ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
+ BUG_ON(!ctx->req_batch);
+
/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
if (aio_nr + nr_events > aio_max_nr ||

Benjamin LaHaise (1):
aio: fix kioctx not being freed after cancellation at exit time

Kent Overstreet (27):
aio: kill return value of aio_complete()
aio: add kiocb_cancel()
aio: move private stuff out of aio.h
aio: dprintk() -> pr_debug()
aio: do fget() after aio_get_req()
aio: make aio_put_req() lockless
aio: refcounting cleanup
wait: add wait_event_hrtimeout()
aio: make aio_read_evt() more efficient, convert to hrtimers
aio: use flush_dcache_page()
aio: use cancellation list lazily
aio: change reqs_active to include unreaped completions
aio: kill batch allocation
aio: kill struct aio_ring_info
aio: give shared kioctx fields their own cachelines
aio: reqs_active -> reqs_available
aio: percpu reqs_available
generic dynamic per cpu refcounting
aio: percpu ioctx refcount
aio: use xchg() instead of completion_lock
aio: don't include aio.h in sched.h
aio: kill ki_key
aio: kill ki_retry
block: Prep work for batch completion
block, aio: batch completion for bios/kiocbs
virtio-blk: convert to batch completion
mtip32xx: convert to batch completion

Zach Brown (5):
mm: remove old aio use_mm() comment
aio: remove dead code from aio.h
gadget: remove only user of aio retry
aio: remove retry-based AIO
char: add aio_{read,write} to /dev/{null,zero}

arch/s390/hypfs/inode.c | 1 +
block/blk-core.c | 34 +-
block/blk-flush.c | 5 +-
block/blk-lib.c | 3 +-
block/blk.h | 3 +-
block/scsi_ioctl.c | 1 +
drivers/block/drbd/drbd_bitmap.c | 2 +-
drivers/block/drbd/drbd_worker.c | 6 +-
drivers/block/drbd/drbd_wrappers.h | 9 +-
drivers/block/floppy.c | 3 +-
drivers/block/mtip32xx/mtip32xx.c | 86 +-
drivers/block/mtip32xx/mtip32xx.h | 8 +-
drivers/block/pktcdvd.c | 9 +-
drivers/block/swim3.c | 2 +-
drivers/block/virtio_blk.c | 31 +-
drivers/block/xen-blkback/blkback.c | 3 +-
drivers/char/mem.c | 36 +
drivers/infiniband/hw/ipath/ipath_file_ops.c | 1 +
drivers/infiniband/hw/qib/qib_file_ops.c | 2 +-
drivers/md/dm-bufio.c | 9 +-
drivers/md/dm-crypt.c | 3 +-
drivers/md/dm-io.c | 2 +-
drivers/md/dm-snap.c | 3 +-
drivers/md/dm-thin.c | 3 +-
drivers/md/dm-verity.c | 3 +-
drivers/md/dm.c | 8 +-
drivers/md/faulty.c | 3 +-
drivers/md/md.c | 9 +-
drivers/md/multipath.c | 3 +-
drivers/md/raid1.c | 15 +-
drivers/md/raid10.c | 21 +-
drivers/md/raid5.c | 15 +-
drivers/scsi/sg.c | 1 +
drivers/staging/android/logger.c | 1 +
drivers/target/target_core_iblock.c | 6 +-
drivers/target/target_core_pscsi.c | 3 +-
drivers/usb/gadget/inode.c | 42 +-
fs/9p/vfs_addr.c | 1 +
fs/afs/write.c | 1 +
fs/aio.c | 1811 +++++++++++---------------
fs/bio-integrity.c | 3 +-
fs/bio.c | 62 +-
fs/block_dev.c | 1 +
fs/btrfs/check-integrity.c | 14 +-
fs/btrfs/compression.c | 6 +-
fs/btrfs/disk-io.c | 6 +-
fs/btrfs/extent_io.c | 12 +-
fs/btrfs/file.c | 1 +
fs/btrfs/inode.c | 14 +-
fs/btrfs/scrub.c | 18 +-
fs/btrfs/volumes.c | 4 +-
fs/buffer.c | 3 +-
fs/ceph/file.c | 1 +
fs/compat.c | 1 +
fs/direct-io.c | 21 +-
fs/ecryptfs/file.c | 1 +
fs/ext2/inode.c | 1 +
fs/ext3/inode.c | 1 +
fs/ext4/file.c | 1 +
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 1 +
fs/ext4/page-io.c | 4 +-
fs/f2fs/data.c | 3 +-
fs/f2fs/segment.c | 3 +-
fs/fat/inode.c | 1 +
fs/fuse/cuse.c | 1 +
fs/fuse/dev.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/aops.c | 1 +
fs/gfs2/file.c | 1 +
fs/gfs2/lops.c | 3 +-
fs/gfs2/ops_fstype.c | 3 +-
fs/hfs/inode.c | 1 +
fs/hfsplus/inode.c | 1 +
fs/hfsplus/wrapper.c | 3 +-
fs/jfs/inode.c | 1 +
fs/jfs/jfs_logmgr.c | 4 +-
fs/jfs/jfs_metapage.c | 6 +-
fs/logfs/dev_bdev.c | 8 +-
fs/mpage.c | 2 +-
fs/nfs/blocklayout/blocklayout.c | 17 +-
fs/nilfs2/inode.c | 2 +-
fs/nilfs2/segbuf.c | 3 +-
fs/ntfs/file.c | 1 +
fs/ntfs/inode.c | 1 +
fs/ocfs2/aops.h | 2 +
fs/ocfs2/cluster/heartbeat.c | 4 +-
fs/ocfs2/dlmglue.c | 2 +-
fs/ocfs2/inode.h | 2 +
fs/pipe.c | 1 +
fs/read_write.c | 35 +-
fs/reiserfs/inode.c | 1 +
fs/ubifs/file.c | 1 +
fs/udf/inode.c | 1 +
fs/xfs/xfs_aops.c | 4 +-
fs/xfs/xfs_buf.c | 3 +-
fs/xfs/xfs_file.c | 1 +
include/linux/aio.h | 199 +--
include/linux/batch_complete.h | 23 +
include/linux/bio.h | 38 +-
include/linux/blk_types.h | 4 +-
include/linux/blkdev.h | 12 +-
include/linux/cgroup.h | 1 +
include/linux/errno.h | 1 -
include/linux/fs.h | 2 +-
include/linux/percpu-refcount.h | 114 ++
include/linux/pid_namespace.h | 1 +
include/linux/sched.h | 2 -
include/linux/swap.h | 3 +-
include/linux/wait.h | 86 ++
include/linux/writeback.h | 1 +
kernel/fork.c | 1 +
kernel/printk.c | 1 +
kernel/ptrace.c | 1 +
lib/Makefile | 2 +-
lib/percpu-refcount.c | 243 ++++
mm/bounce.c | 12 +-
mm/mmu_context.c | 3 -
mm/page_io.c | 6 +-
mm/shmem.c | 1 +
mm/swap.c | 1 +
security/keys/internal.h | 2 +
security/keys/keyctl.c | 1 +
sound/core/pcm_native.c | 2 +-
124 files changed, 1785 insertions(+), 1488 deletions(-)
create mode 100644 include/linux/batch_complete.h
create mode 100644 include/linux/percpu-refcount.h
create mode 100644 lib/percpu-refcount.c

--
1.8.1.3


2013-03-21 16:36:11

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 03/33] gadget: remove only user of aio retry

From: Zach Brown <[email protected]>

This removes the only in-tree user of aio retry. This will let us remove
the retry code from the aio core.

Removing retry is relatively easy as the USB gadget wasn't using it to
retry IOs at all. It always fully submitted the IO in the context of the
initial io_submit() call. It only used the AIO retry facility to get the
submitter's mm context for copying the result of a read back to user
space. This is easy to implement with use_mm() and a work struct, much
like kvm does with async_pf_execute() for get_user_pages().

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/usb/gadget/inode.c | 38 +++++++++++++++++++++++++++++---------
1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index e2b2e9c..a1aad43 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -24,6 +24,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/poll.h>
+#include <linux/mmu_context.h>

#include <linux/device.h>
#include <linux/moduleparam.h>
@@ -513,6 +514,9 @@ static long ep_ioctl(struct file *fd, unsigned code, unsigned long value)
struct kiocb_priv {
struct usb_request *req;
struct ep_data *epdata;
+ struct kiocb *iocb;
+ struct mm_struct *mm;
+ struct work_struct work;
void *buf;
const struct iovec *iv;
unsigned long nr_segs;
@@ -540,15 +544,12 @@ static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
return value;
}

-static ssize_t ep_aio_read_retry(struct kiocb *iocb)
+static ssize_t ep_copy_to_user(struct kiocb_priv *priv)
{
- struct kiocb_priv *priv = iocb->private;
ssize_t len, total;
void *to_copy;
int i;

- /* we "retry" to get the right mm context for this: */
-
/* copy stuff into user buffers */
total = priv->actual;
len = 0;
@@ -568,9 +569,26 @@ static ssize_t ep_aio_read_retry(struct kiocb *iocb)
if (total == 0)
break;
}
+
+ return len;
+}
+
+static void ep_user_copy_worker(struct work_struct *work)
+{
+ struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
+ struct mm_struct *mm = priv->mm;
+ struct kiocb *iocb = priv->iocb;
+ size_t ret;
+
+ use_mm(mm);
+ ret = ep_copy_to_user(priv);
+ unuse_mm(mm);
+
+ /* completing the iocb can drop the ctx and mm, don't touch mm after */
+ aio_complete(iocb, ret, ret);
+
kfree(priv->buf);
kfree(priv);
- return len;
}

static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
@@ -596,14 +614,14 @@ static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
aio_complete(iocb, req->actual ? req->actual : req->status,
req->status);
} else {
- /* retry() won't report both; so we hide some faults */
+ /* ep_copy_to_user() won't report both; we hide some faults */
if (unlikely(0 != req->status))
DBG(epdata->dev, "%s fault %d len %d\n",
ep->name, req->status, req->actual);

priv->buf = req->buf;
priv->actual = req->actual;
- kick_iocb(iocb);
+ schedule_work(&priv->work);
}
spin_unlock(&epdata->dev->lock);

@@ -633,8 +651,10 @@ fail:
return value;
}
iocb->private = priv;
+ priv->iocb = iocb;
priv->iv = iv;
priv->nr_segs = nr_segs;
+ INIT_WORK(&priv->work, ep_user_copy_worker);

value = get_ready_ep(iocb->ki_filp->f_flags, epdata);
if (unlikely(value < 0)) {
@@ -646,6 +666,7 @@ fail:
get_ep(epdata);
priv->epdata = epdata;
priv->actual = 0;
+ priv->mm = current->mm; /* mm teardown waits for iocbs in exit_aio() */

/* each kiocb is coupled to one usb_request, but we can't
* allocate or submit those if the host disconnected.
@@ -674,7 +695,7 @@ fail:
kfree(priv);
put_ep(epdata);
} else
- value = (iv ? -EIOCBRETRY : -EIOCBQUEUED);
+ value = -EIOCBQUEUED;
return value;
}

@@ -692,7 +713,6 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(!buf))
return -ENOMEM;

- iocb->ki_retry = ep_aio_read_retry;
return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
}

--
1.8.1.3

2013-03-21 16:36:24

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 10/33] aio: do fget() after aio_get_req()

aio_get_req() will fail if we have the maximum number of requests
outstanding, which depending on the application may not be uncommon. So
avoid doing an unnecessary fget().

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2637555..4f23d43 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -587,6 +587,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
{
assert_spin_locked(&ctx->ctx_lock);

+ if (req->ki_filp)
+ fput(req->ki_filp);
if (req->ki_eventfd != NULL)
eventfd_ctx_put(req->ki_eventfd);
if (req->ki_dtor)
@@ -605,9 +607,6 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
*/
static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
- pr_debug("(%p): f_count=%ld\n",
- req, atomic_long_read(&req->ki_filp->f_count));
-
assert_spin_locked(&ctx->ctx_lock);

req->ki_users--;
@@ -618,8 +617,6 @@ static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
req->ki_cancel = NULL;
req->ki_retry = NULL;

- fput(req->ki_filp);
- req->ki_filp = NULL;
really_put_req(ctx, req);
}

@@ -1266,7 +1263,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
bool compat)
{
struct kiocb *req;
- struct file *file;
ssize_t ret;

/* enforce forwards compatibility on users */
@@ -1285,16 +1281,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return -EINVAL;
}

- file = fget(iocb->aio_fildes);
- if (unlikely(!file))
- return -EBADF;
-
req = aio_get_req(ctx, batch); /* returns with 2 references to req */
- if (unlikely(!req)) {
- fput(file);
+ if (unlikely(!req))
return -EAGAIN;
+
+ req->ki_filp = fget(iocb->aio_fildes);
+ if (unlikely(!req->ki_filp)) {
+ ret = -EBADF;
+ goto out_put_req;
}
- req->ki_filp = file;
+
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
/*
* If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
--
1.8.1.3

2013-03-21 16:36:38

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 18/33] aio: kill batch allocation

Previously, allocating a kiocb required touching quite a few global (well,
per kioctx) cachelines... so batching up allocation to amortize those was
worthwhile. But we've gotten rid of some of those, and in another couple
of patches kiocb allocation won't require writing to any shared
cachelines, so that means we can just rip this code out.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 116 +++++++---------------------------------------------
include/linux/aio.h | 1 -
2 files changed, 15 insertions(+), 102 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6828a31..95fcd08 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -510,108 +510,27 @@ void exit_aio(struct mm_struct *mm)
* This prevents races between the aio code path referencing the
* req (after submitting it) and aio_complete() freeing the req.
*/
-static struct kiocb *__aio_get_req(struct kioctx *ctx)
+static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
- struct kiocb *req = NULL;
+ struct kiocb *req;
+
+ if (atomic_read(&ctx->reqs_active) >= ctx->ring_info.nr)
+ return NULL;
+
+ if (atomic_inc_return(&ctx->reqs_active) > ctx->ring_info.nr - 1)
+ goto out_put;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
- return NULL;
+ goto out_put;

atomic_set(&req->ki_users, 2);
req->ki_ctx = ctx;

return req;
-}
-
-/*
- * struct kiocb's are allocated in batches to reduce the number of
- * times the ctx lock is acquired and released.
- */
-#define KIOCB_BATCH_SIZE 32L
-struct kiocb_batch {
- struct list_head head;
- long count; /* number of requests left to allocate */
-};
-
-static void kiocb_batch_init(struct kiocb_batch *batch, long total)
-{
- INIT_LIST_HEAD(&batch->head);
- batch->count = total;
-}
-
-static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
-{
- struct kiocb *req, *n;
-
- if (list_empty(&batch->head))
- return;
-
- spin_lock_irq(&ctx->ctx_lock);
- list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
- list_del(&req->ki_batch);
- kmem_cache_free(kiocb_cachep, req);
- atomic_dec(&ctx->reqs_active);
- }
- spin_unlock_irq(&ctx->ctx_lock);
-}
-
-/*
- * Allocate a batch of kiocbs. This avoids taking and dropping the
- * context lock a lot during setup.
- */
-static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
-{
- unsigned short allocated, to_alloc;
- long avail;
- struct kiocb *req, *n;
-
- to_alloc = min(batch->count, KIOCB_BATCH_SIZE);
- for (allocated = 0; allocated < to_alloc; allocated++) {
- req = __aio_get_req(ctx);
- if (!req)
- /* allocation failed, go with what we've got */
- break;
- list_add(&req->ki_batch, &batch->head);
- }
-
- if (allocated == 0)
- goto out;
-
- spin_lock_irq(&ctx->ctx_lock);
-
- avail = ctx->ring_info.nr - atomic_read(&ctx->reqs_active) - 1;
- BUG_ON(avail < 0);
- if (avail < allocated) {
- /* Trim back the number of requests. */
- list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
- list_del(&req->ki_batch);
- kmem_cache_free(kiocb_cachep, req);
- if (--allocated <= avail)
- break;
- }
- }
-
- batch->count -= allocated;
- atomic_add(allocated, &ctx->reqs_active);
-
- spin_unlock_irq(&ctx->ctx_lock);
-
-out:
- return allocated;
-}
-
-static inline struct kiocb *aio_get_req(struct kioctx *ctx,
- struct kiocb_batch *batch)
-{
- struct kiocb *req;
-
- if (list_empty(&batch->head))
- if (kiocb_batch_refill(ctx, batch) == 0)
- return NULL;
- req = list_first_entry(&batch->head, struct kiocb, ki_batch);
- list_del(&req->ki_batch);
- return req;
+out_put:
+ atomic_dec(&ctx->reqs_active);
+ return NULL;
}

static void kiocb_free(struct kiocb *req)
@@ -1192,8 +1111,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
}

static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
- struct iocb *iocb, struct kiocb_batch *batch,
- bool compat)
+ struct iocb *iocb, bool compat)
{
struct kiocb *req;
ssize_t ret;
@@ -1214,7 +1132,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return -EINVAL;
}

- req = aio_get_req(ctx, batch); /* returns with 2 references to req */
+ req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req))
return -EAGAIN;

@@ -1286,7 +1204,6 @@ long do_io_submit(aio_context_t ctx_id, long nr,
long ret = 0;
int i = 0;
struct blk_plug plug;
- struct kiocb_batch batch;

if (unlikely(nr < 0))
return -EINVAL;
@@ -1303,8 +1220,6 @@ long do_io_submit(aio_context_t ctx_id, long nr,
return -EINVAL;
}

- kiocb_batch_init(&batch, nr);
-
blk_start_plug(&plug);

/*
@@ -1325,13 +1240,12 @@ long do_io_submit(aio_context_t ctx_id, long nr,
break;
}

- ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat);
+ ret = io_submit_one(ctx, user_iocb, &tmp, compat);
if (ret)
break;
}
blk_finish_plug(&plug);

- kiocb_batch_free(ctx, &batch);
put_ioctx(ctx);
return i ? i : ret;
}
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d2a0003..f0a8481 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -85,7 +85,6 @@ struct kiocb {

struct list_head ki_list; /* the aio core uses this
* for cancellation */
- struct list_head ki_batch; /* batch allocation */

/*
* If the aio_resfd field of the userspace iocb is not zero,
--
1.8.1.3

2013-03-21 16:36:44

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 25/33] aio: use xchg() instead of completion_lock

So, for sticking kiocb completions on the kioctx ringbuffer, we need a
lock - it unfortunately can't be lockless.

When the kioctx is shared between threads on different cpus and the rate
of completions is high, this lock sees quite a bit of contention - in
terms of cacheline contention it's the hottest thing in the aio subsystem.

That means, with a regular spinlock, we're going to take a cache miss to
grab the lock, then another cache miss when we touch the data the lock
protects - if it's on the same cacheline as the lock, other cpus spinning
on the lock are going to be pulling it out from under us as we're using
it.

So, we use an old trick to get rid of this second forced cache miss - make
the data the lock protects be the lock itself, so we grab them both at
once.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 59 ++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 3db2dab..e4b1cc1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -120,11 +120,23 @@ struct kioctx {
struct {
struct mutex ring_lock;
wait_queue_head_t wait;
+
+ /*
+ * Copy of the real tail - to reduce cacheline bouncing. Updated
+ * by aio_complete() whenever it updates the real tail.
+ */
+ unsigned shadow_tail;
} ____cacheline_aligned_in_smp;

struct {
+ /*
+ * This is the canonical copy of the tail pointer, updated by
+ * aio_complete(). But aio_complete() also uses it as a lock, so
+ * other code can't use it; aio_complete() keeps shadow_tail in
+ * sync with the real value of the tail pointer for other code
+ * to use.
+ */
unsigned tail;
- spinlock_t completion_lock;
} ____cacheline_aligned_in_smp;

struct page *internal_pages[AIO_RING_PAGES];
@@ -336,9 +348,10 @@ static void free_ioctx(struct kioctx *ctx)
kunmap_atomic(ring);

while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
- wait_event(ctx->wait, head != ctx->tail);
+ wait_event(ctx->wait, head != ctx->shadow_tail);

- avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;
+ avail = (head <= ctx->shadow_tail
+ ? ctx->shadow_tail : ctx->nr_events) - head;

atomic_add(avail, &ctx->reqs_available);
head += avail;
@@ -415,7 +428,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
rcu_read_unlock();

spin_lock_init(&ctx->ctx_lock);
- spin_lock_init(&ctx->completion_lock);
mutex_init(&ctx->ring_lock);
init_waitqueue_head(&ctx->wait);

@@ -713,18 +725,19 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
* free_ioctx()
*/
atomic_inc(&ctx->reqs_available);
+ smp_mb__after_atomic_inc();
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
}

/*
- * Add a completion event to the ring buffer. Must be done holding
- * ctx->ctx_lock to prevent other code from messing with the tail
- * pointer since we might be called from irq context.
+ * Add a completion event to the ring buffer; ctx->tail is both our lock
+ * and the canonical version of the tail pointer.
*/
- spin_lock_irqsave(&ctx->completion_lock, flags);
+ local_irq_save(flags);
+ while ((tail = xchg(&ctx->tail, UINT_MAX)) == UINT_MAX)
+ cpu_relax();

- tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;

if (++tail >= ctx->nr_events)
@@ -750,14 +763,18 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
smp_wmb(); /* make event visible before updating tail */

- ctx->tail = tail;
+ ctx->shadow_tail = tail;

ring = kmap_atomic(ctx->ring_pages[0]);
ring->tail = tail;
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);

- spin_unlock_irqrestore(&ctx->completion_lock, flags);
+ /* unlock, make new tail visible before checking waitlist */
+ smp_mb();
+
+ ctx->tail = tail;
+ local_irq_restore(flags);

pr_debug("added to ring %p at [%u]\n", iocb, tail);

@@ -773,14 +790,6 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
aio_put_req(iocb);

- /*
- * We have to order our ring_info tail store above and test
- * of the wait list below outside the wait lock. This is
- * like in wake_up_bit() where clearing a bit has to be
- * ordered with the unlocked test.
- */
- smp_mb();
-
if (waitqueue_active(&ctx->wait))
wake_up(&ctx->wait);

@@ -806,18 +815,18 @@ static long aio_read_events_ring(struct kioctx *ctx,
head = ring->head;
kunmap_atomic(ring);

- pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);
+ pr_debug("h%u t%u m%u\n", head, ctx->shadow_tail, ctx->nr_events);

- if (head == ctx->tail)
+ if (head == ctx->shadow_tail)
goto out;

while (ret < nr) {
- long avail = (head <= ctx->tail
- ? ctx->tail : ctx->nr_events) - head;
+ long avail = (head <= ctx->shadow_tail
+ ? ctx->shadow_tail : ctx->nr_events) - head;
struct io_event *ev;
struct page *page;

- if (head == ctx->tail)
+ if (head == ctx->shadow_tail)
break;

avail = min(avail, nr - ret);
@@ -847,7 +856,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);

- pr_debug("%li h%u t%u\n", ret, head, ctx->tail);
+ pr_debug("%li h%u t%u\n", ret, head, ctx->shadow_tail);

put_reqs_available(ctx, ret);
out:
--
1.8.1.3

2013-03-21 16:36:50

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 30/33] block, aio: batch completion for bios/kiocbs

When completing a kiocb, there's some fixed overhead from touching the
kioctx's ring buffer the kiocb belongs to. Some newer high end block
devices can complete multiple IOs per interrupt, much like many network
interfaces have been for some time.

This plumbs through infrastructure so we can take advantage of multiple
completions at the interrupt level, and complete multiple kiocbs at the
same time.

Drivers have to be converted to take advantage of this, but it's a simple
change and the next patches will convert a few drivers.

To use it, an interrupt handler (or any code that completes bios or
requests) declares and initializes a struct batch_complete:

struct batch_complete batch;
batch_complete_init(&batch);

Then, instead of calling bio_endio(), it calls
bio_endio_batch(bio, err, &batch). This just adds the bio to a list in
the batch_complete.

At the end, it calls

batch_complete(&batch);

This completes all the bios all at once, building up a list of kiocbs;
then the list of kiocbs are completed all at once.

[[email protected]: fix warning]
[[email protected]: fs/aio.c needs bio.h, move bio_endio_batch() declaration somewhere rational]
[[email protected]: fix warnings]
[[email protected]: fix build error due to bio_endio_batch]
[[email protected]: fix tracepoint in batch_complete()]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
block/blk-core.c | 34 +++---
block/blk-flush.c | 2 +-
block/blk.h | 3 +-
drivers/block/swim3.c | 2 +-
drivers/md/dm.c | 2 +-
fs/aio.c | 257 ++++++++++++++++++++++++++++-------------
fs/bio.c | 50 ++++----
fs/direct-io.c | 11 +-
include/linux/aio.h | 25 +++-
include/linux/batch_complete.h | 23 ++++
include/linux/bio.h | 36 +++++-
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 12 +-
13 files changed, 321 insertions(+), 137 deletions(-)
create mode 100644 include/linux/batch_complete.h

diff --git a/block/blk-core.c b/block/blk-core.c
index 074b758..186603b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,7 +151,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
EXPORT_SYMBOL(blk_rq_init);

static void req_bio_endio(struct request *rq, struct bio *bio,
- unsigned int nbytes, int error)
+ unsigned int nbytes, int error,
+ struct batch_complete *batch)
{
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -175,7 +176,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,

/* don't actually finish bio if it's part of flush sequence */
if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
- bio_endio(bio, error);
+ bio_endio_batch(bio, error, batch);
}

void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -2250,7 +2251,8 @@ EXPORT_SYMBOL(blk_fetch_request);
* %false - this request doesn't have any more data
* %true - this request has more data
**/
-bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
+bool blk_update_request(struct request *req, int error, unsigned int nr_bytes,
+ struct batch_complete *batch)
{
int total_bytes, bio_nbytes, next_idx = 0;
struct bio *bio;
@@ -2306,7 +2308,7 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
if (nr_bytes >= bio->bi_size) {
req->bio = bio->bi_next;
nbytes = bio->bi_size;
- req_bio_endio(req, bio, nbytes, error);
+ req_bio_endio(req, bio, nbytes, error, batch);
next_idx = 0;
bio_nbytes = 0;
} else {
@@ -2368,7 +2370,7 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
* if the request wasn't completed, update state
*/
if (bio_nbytes) {
- req_bio_endio(req, bio, bio_nbytes, error);
+ req_bio_endio(req, bio, bio_nbytes, error, batch);
bio->bi_idx += next_idx;
bio_iovec(bio)->bv_offset += nr_bytes;
bio_iovec(bio)->bv_len -= nr_bytes;
@@ -2405,14 +2407,15 @@ EXPORT_SYMBOL_GPL(blk_update_request);

static bool blk_update_bidi_request(struct request *rq, int error,
unsigned int nr_bytes,
- unsigned int bidi_bytes)
+ unsigned int bidi_bytes,
+ struct batch_complete *batch)
{
- if (blk_update_request(rq, error, nr_bytes))
+ if (blk_update_request(rq, error, nr_bytes, batch))
return true;

/* Bidi request must be completed as a whole */
if (unlikely(blk_bidi_rq(rq)) &&
- blk_update_request(rq->next_rq, error, bidi_bytes))
+ blk_update_request(rq->next_rq, error, bidi_bytes, batch))
return true;

if (blk_queue_add_random(rq->q))
@@ -2495,7 +2498,7 @@ static bool blk_end_bidi_request(struct request *rq, int error,
struct request_queue *q = rq->q;
unsigned long flags;

- if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
+ if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes, NULL))
return true;

spin_lock_irqsave(q->queue_lock, flags);
@@ -2521,9 +2524,10 @@ static bool blk_end_bidi_request(struct request *rq, int error,
* %true - still buffers pending for this request
**/
bool __blk_end_bidi_request(struct request *rq, int error,
- unsigned int nr_bytes, unsigned int bidi_bytes)
+ unsigned int nr_bytes, unsigned int bidi_bytes,
+ struct batch_complete *batch)
{
- if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
+ if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes, batch))
return true;

blk_finish_request(rq, error);
@@ -2624,7 +2628,7 @@ EXPORT_SYMBOL_GPL(blk_end_request_err);
**/
bool __blk_end_request(struct request *rq, int error, unsigned int nr_bytes)
{
- return __blk_end_bidi_request(rq, error, nr_bytes, 0);
+ return __blk_end_bidi_request(rq, error, nr_bytes, 0, NULL);
}
EXPORT_SYMBOL(__blk_end_request);

@@ -2636,7 +2640,7 @@ EXPORT_SYMBOL(__blk_end_request);
* Description:
* Completely finish @rq. Must be called with queue lock held.
*/
-void __blk_end_request_all(struct request *rq, int error)
+void blk_end_request_all_batch(struct request *rq, int error, struct batch_complete *batch)
{
bool pending;
unsigned int bidi_bytes = 0;
@@ -2644,10 +2648,10 @@ void __blk_end_request_all(struct request *rq, int error)
if (unlikely(blk_bidi_rq(rq)))
bidi_bytes = blk_rq_bytes(rq->next_rq);

- pending = __blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes);
+ pending = __blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes, batch);
BUG_ON(pending);
}
-EXPORT_SYMBOL(__blk_end_request_all);
+EXPORT_SYMBOL(blk_end_request_all_batch);

/**
* __blk_end_request_cur - Helper function to finish the current request chunk.
diff --git a/block/blk-flush.c b/block/blk-flush.c
index d994710..8f6ddeb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -316,7 +316,7 @@ void blk_insert_flush(struct request *rq)
* complete the request.
*/
if (!policy) {
- __blk_end_bidi_request(rq, 0, 0, 0);
+ __blk_end_bidi_request(rq, 0, 0, 0, NULL);
return;
}

diff --git a/block/blk.h b/block/blk.h
index e837b8f..dc8fee6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -31,7 +31,8 @@ void blk_queue_bypass_end(struct request_queue *q);
void blk_dequeue_request(struct request *rq);
void __blk_queue_free_tags(struct request_queue *q);
bool __blk_end_bidi_request(struct request *rq, int error,
- unsigned int nr_bytes, unsigned int bidi_bytes);
+ unsigned int nr_bytes, unsigned int bidi_bytes,
+ struct batch_complete *batch);

void blk_rq_timed_out_timer(unsigned long data);
void blk_delete_timer(struct request *);
diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 758f2ac..deb722d 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -775,7 +775,7 @@ static irqreturn_t swim3_interrupt(int irq, void *dev_id)
if (intr & ERROR_INTR) {
n = fs->scount - 1 - resid / 512;
if (n > 0) {
- blk_update_request(req, 0, n << 9);
+ blk_update_request(req, 0, n << 9, NULL);
fs->req_sector += n;
}
if (fs->retries < 5) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a1e371a..142f271 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -697,7 +697,7 @@ static void end_clone_bio(struct bio *clone, int error,
* Do not use blk_end_request() here, because it may complete
* the original request before the clone, and break the ordering.
*/
- blk_update_request(tio->orig, 0, nr_bytes);
+ blk_update_request(tio->orig, 0, nr_bytes, NULL);
}

/*
diff --git a/fs/aio.c b/fs/aio.c
index ba23c03..4dbd240 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -25,6 +25,7 @@
#include <linux/file.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/bio.h>
#include <linux/mmu_context.h>
#include <linux/percpu.h>
#include <linux/slab.h>
@@ -674,71 +675,11 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
return ret;
}

-/* aio_complete
- * Called when the io request on the given iocb is complete.
- */
-void aio_complete(struct kiocb *iocb, long res, long res2)
+static inline unsigned kioctx_ring_put(struct kioctx *ctx, struct kiocb *req,
+ unsigned tail)
{
- struct kioctx *ctx = iocb->ki_ctx;
- struct aio_ring *ring;
struct io_event *ev_page, *event;
- unsigned long flags;
- unsigned tail, pos;
-
- /*
- * Special case handling for sync iocbs:
- * - events go directly into the iocb for fast handling
- * - the sync task with the iocb in its stack holds the single iocb
- * ref, no other paths have a way to get another ref
- * - the sync task helpfully left a reference to itself in the iocb
- */
- if (is_sync_kiocb(iocb)) {
- BUG_ON(atomic_read(&iocb->ki_users) != 1);
- iocb->ki_user_data = res;
- atomic_set(&iocb->ki_users, 0);
- wake_up_process(iocb->ki_obj.tsk);
- return;
- }
-
- /*
- * Take rcu_read_lock() in case the kioctx is being destroyed, as we
- * need to issue a wakeup after incrementing reqs_available.
- */
- rcu_read_lock();
-
- if (iocb->ki_list.next) {
- unsigned long flags;
-
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_del(&iocb->ki_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- }
-
- /*
- * cancelled requests don't get events, userland was given one
- * when the event got cancelled.
- */
- if (unlikely(xchg(&iocb->ki_cancel,
- KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
- /*
- * Can't use the percpu reqs_available here - could race with
- * free_ioctx()
- */
- atomic_inc(&ctx->reqs_available);
- smp_mb__after_atomic_inc();
- /* Still need the wake_up in case free_ioctx is waiting */
- goto put_rq;
- }
-
- /*
- * Add a completion event to the ring buffer; ctx->tail is both our lock
- * and the canonical version of the tail pointer.
- */
- local_irq_save(flags);
- while ((tail = xchg(&ctx->tail, UINT_MAX)) == UINT_MAX)
- cpu_relax();
-
- pos = tail + AIO_EVENTS_OFFSET;
+ unsigned pos = tail + AIO_EVENTS_OFFSET;

if (++tail >= ctx->nr_events)
tail = 0;
@@ -746,22 +687,44 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;

- event->obj = (u64)(unsigned long)iocb->ki_obj.user;
- event->data = iocb->ki_user_data;
- event->res = res;
- event->res2 = res2;
+ event->obj = (u64)(unsigned long)req->ki_obj.user;
+ event->data = req->ki_user_data;
+ event->res = req->ki_res;
+ event->res2 = req->ki_res2;

kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
- ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
- res, res2);
+ ctx, tail, req, req->ki_obj.user, req->ki_user_data,
+ req->ki_res, req->ki_res2);
+
+ return tail;
+}

- /* after flagging the request as done, we
- * must never even look at it again
+static inline unsigned kioctx_ring_lock(struct kioctx *ctx)
+{
+ unsigned tail;
+
+ /*
+ * ctx->tail is both our lock and the canonical version of the tail
+ * pointer.
*/
- smp_wmb(); /* make event visible before updating tail */
+ while ((tail = xchg(&ctx->tail, UINT_MAX)) == UINT_MAX)
+ cpu_relax();
+
+ return tail;
+}
+
+static inline void kioctx_ring_unlock(struct kioctx *ctx, unsigned tail)
+{
+ struct aio_ring *ring;
+
+ if (!ctx)
+ return;
+
+ smp_wmb();
+ /* make event visible before updating tail */

ctx->shadow_tail = tail;

@@ -774,28 +737,156 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
smp_mb();

ctx->tail = tail;
- local_irq_restore(flags);

- pr_debug("added to ring %p at [%u]\n", iocb, tail);
+ if (waitqueue_active(&ctx->wait))
+ wake_up(&ctx->wait);
+}
+
+void batch_complete_aio(struct batch_complete *batch)
+{
+ struct kioctx *ctx = NULL;
+ struct eventfd_ctx *eventfd = NULL;
+ struct rb_node *n;
+ unsigned long flags;
+ unsigned tail = 0;
+
+ if (RB_EMPTY_ROOT(&batch->kiocb))
+ return;
+
+ /*
+ * Take rcu_read_lock() in case the kioctx is being destroyed, as we
+ * need to issue a wakeup after incrementing reqs_available.
+ */
+ rcu_read_lock();
+ local_irq_save(flags);
+
+ n = rb_first(&batch->kiocb);
+ while (n) {
+ struct kiocb *req = container_of(n, struct kiocb, ki_node);
+
+ if (n->rb_right) {
+ n->rb_right->__rb_parent_color = n->__rb_parent_color;
+ n = n->rb_right;
+
+ while (n->rb_left)
+ n = n->rb_left;
+ } else {
+ n = rb_parent(n);
+ }
+
+ if (unlikely(xchg(&req->ki_cancel,
+ KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
+ /*
+ * Can't use the percpu reqs_available here - could race
+ * with free_ioctx()
+ */
+ atomic_inc(&req->ki_ctx->reqs_available);
+ aio_put_req(req);
+ continue;
+ }
+
+ if (unlikely(req->ki_eventfd != eventfd)) {
+ if (eventfd) {
+ /* Make event visible */
+ kioctx_ring_unlock(ctx, tail);
+ ctx = NULL;
+
+ eventfd_signal(eventfd, 1);
+ eventfd_ctx_put(eventfd);
+ }
+
+ eventfd = req->ki_eventfd;
+ req->ki_eventfd = NULL;
+ }
+
+ if (unlikely(req->ki_ctx != ctx)) {
+ kioctx_ring_unlock(ctx, tail);
+
+ ctx = req->ki_ctx;
+ tail = kioctx_ring_lock(ctx);
+ }
+
+ tail = kioctx_ring_put(ctx, req, tail);
+ aio_put_req(req);
+ }
+
+ kioctx_ring_unlock(ctx, tail);
+ local_irq_restore(flags);
+ rcu_read_unlock();

/*
* Check if the user asked us to deliver the result through an
* eventfd. The eventfd_signal() function is safe to be called
* from IRQ context.
*/
- if (iocb->ki_eventfd != NULL)
- eventfd_signal(iocb->ki_eventfd, 1);
+ if (eventfd) {
+ eventfd_signal(eventfd, 1);
+ eventfd_ctx_put(eventfd);
+ }
+}
+EXPORT_SYMBOL(batch_complete_aio);

-put_rq:
- /* everything turned out well, dispose of the aiocb. */
- aio_put_req(iocb);
+/* aio_complete_batch
+ * Called when the io request on the given iocb is complete; @batch may be
+ * NULL.
+ */
+void aio_complete_batch(struct kiocb *req, long res, long res2,
+ struct batch_complete *batch)
+{
+ req->ki_res = res;
+ req->ki_res2 = res2;

- if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ if (req->ki_list.next) {
+ struct kioctx *ctx = req->ki_ctx;
+ unsigned long flags;

- rcu_read_unlock();
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ list_del(&req->ki_list);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ }
+
+ /*
+ * Special case handling for sync iocbs:
+ * - events go directly into the iocb for fast handling
+ * - the sync task with the iocb in its stack holds the single iocb
+ * ref, no other paths have a way to get another ref
+ * - the sync task helpfully left a reference to itself in the iocb
+ */
+ if (is_sync_kiocb(req)) {
+ BUG_ON(atomic_read(&req->ki_users) != 1);
+ req->ki_user_data = req->ki_res;
+ atomic_set(&req->ki_users, 0);
+ wake_up_process(req->ki_obj.tsk);
+ } else if (batch) {
+ int res;
+ struct kiocb *t;
+ struct rb_node **n = &batch->kiocb.rb_node, *parent = NULL;
+
+ while (*n) {
+ parent = *n;
+ t = container_of(*n, struct kiocb, ki_node);
+
+ res = req->ki_ctx != t->ki_ctx
+ ? req->ki_ctx < t->ki_ctx
+ : req->ki_eventfd != t->ki_eventfd
+ ? req->ki_eventfd < t->ki_eventfd
+ : req < t;
+
+ n = res ? &(*n)->rb_left : &(*n)->rb_right;
+ }
+
+ rb_link_node(&req->ki_node, parent, n);
+ rb_insert_color(&req->ki_node, &batch->kiocb);
+ } else {
+ struct batch_complete batch_stack;
+
+ memset(&req->ki_node, 0, sizeof(req->ki_node));
+ batch_stack.kiocb.rb_node = &req->ki_node;
+
+ batch_complete_aio(&batch_stack);
+ }
}
-EXPORT_SYMBOL(aio_complete);
+EXPORT_SYMBOL(aio_complete_batch);

/* aio_read_events
* Pull an event off of the ioctx's event ring. Returns the number of
diff --git a/fs/bio.c b/fs/bio.c
index b2f9c0d..952efb9 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -27,6 +27,7 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/cgroup.h>
+#include <linux/aio.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#include <trace/events/block.h>
@@ -1409,33 +1410,42 @@ void bio_flush_dcache_pages(struct bio *bi)
EXPORT_SYMBOL(bio_flush_dcache_pages);
#endif

-/**
- * bio_endio - end I/O on a bio
- * @bio: bio
- * @error: error, if any
- *
- * Description:
- * bio_endio() will end I/O on the whole bio. bio_endio() is the
- * preferred way to end I/O on a bio, it takes care of clearing
- * BIO_UPTODATE on error. @error is 0 on success, and and one of the
- * established -Exxxx (-EIO, for instance) error values in case
- * something went wrong. No one should call bi_end_io() directly on a
- * bio unless they own it and thus know that it has an end_io
- * function.
- **/
-void bio_endio(struct bio *bio, int error)
+static inline void __bio_endio(struct bio *bio, struct batch_complete *batch)
{
- if (error)
+ if (bio->bi_error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
- error = -EIO;
+ bio->bi_error = -EIO;
+
+ if (bio->bi_end_io)
+ bio->bi_end_io(bio, bio->bi_error, batch);
+}
+
+void bio_endio_batch(struct bio *bio, int error, struct batch_complete *batch)
+{
+ if (error)
+ bio->bi_error = error;

trace_block_bio_complete(bio, error);

- if (bio->bi_end_io)
- bio->bi_end_io(bio, error, NULL);
+ if (batch)
+ bio_list_add(&batch->bio, bio);
+ else
+ __bio_endio(bio, batch);
+
+}
+EXPORT_SYMBOL(bio_endio_batch);
+
+void batch_complete(struct batch_complete *batch)
+{
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&batch->bio)))
+ __bio_endio(bio, batch);
+
+ batch_complete_aio(batch);
}
-EXPORT_SYMBOL(bio_endio);
+EXPORT_SYMBOL(batch_complete);

void bio_pair_release(struct bio_pair *bp)
{
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 6ab9b88..bde1ab4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -230,7 +230,8 @@ static inline struct page *dio_get_page(struct dio *dio,
* filesystems can use it to hold additional state between get_block calls and
* dio_complete.
*/
-static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
+static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async,
+ struct batch_complete *batch)
{
ssize_t transferred = 0;

@@ -264,7 +265,7 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
} else {
inode_dio_done(dio->inode);
if (is_async)
- aio_complete(dio->iocb, ret, 0);
+ aio_complete_batch(dio->iocb, ret, 0, batch);
}

return ret;
@@ -274,7 +275,7 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio);
/*
* Asynchronous IO callback.
*/
-static void dio_bio_end_aio(struct bio *bio, int error)
+static void dio_bio_end_aio(struct bio *bio, int error, struct batch_complete *batch)
{
struct dio *dio = bio->bi_private;
unsigned long remaining;
@@ -290,7 +291,7 @@ static void dio_bio_end_aio(struct bio *bio, int error)
spin_unlock_irqrestore(&dio->bio_lock, flags);

if (remaining == 0) {
- dio_complete(dio, dio->iocb->ki_pos, 0, true);
+ dio_complete(dio, dio->iocb->ki_pos, 0, true, batch);
kmem_cache_free(dio_cache, dio);
}
}
@@ -1270,7 +1271,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
dio_await_completion(dio);

if (drop_refcount(dio) == 0) {
- retval = dio_complete(dio, offset, retval, false);
+ retval = dio_complete(dio, offset, retval, false, NULL);
kmem_cache_free(dio_cache, dio);
} else
BUG_ON(retval != -EIOCBQUEUED);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 1bdf965..a7e4c59 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -6,11 +6,12 @@
#include <linux/aio_abi.h>
#include <linux/uio.h>
#include <linux/rcupdate.h>
-
#include <linux/atomic.h>
+#include <linux/batch_complete.h>

struct kioctx;
struct kiocb;
+struct batch_complete;

#define KIOCB_KEY 0

@@ -30,6 +31,8 @@ struct kiocb;
typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);

struct kiocb {
+ struct rb_node ki_node;
+
atomic_t ki_users;

struct file *ki_filp;
@@ -43,6 +46,9 @@ struct kiocb {
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
+ long ki_res;
+ long ki_res2;
+
loff_t ki_pos;

void *private;
@@ -85,7 +91,9 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
extern void aio_put_req(struct kiocb *iocb);
-extern void aio_complete(struct kiocb *iocb, long res, long res2);
+extern void batch_complete_aio(struct batch_complete *batch);
+extern void aio_complete_batch(struct kiocb *iocb, long res, long res2,
+ struct batch_complete *batch);
struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
@@ -94,7 +102,13 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
static inline void aio_put_req(struct kiocb *iocb) { }
-static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
+
+static inline void batch_complete_aio(struct batch_complete *batch) { }
+static inline void aio_complete_batch(struct kiocb *iocb, long res, long res2,
+ struct batch_complete *batch)
+{
+ return;
+}
struct mm_struct;
static inline void exit_aio(struct mm_struct *mm) { }
static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -104,6 +118,11 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
kiocb_cancel_fn *cancel) { }
#endif /* CONFIG_AIO */

+static inline void aio_complete(struct kiocb *iocb, long res, long res2)
+{
+ aio_complete_batch(iocb, res, res2, NULL);
+}
+
static inline struct kiocb *list_kiocb(struct list_head *h)
{
return list_entry(h, struct kiocb, ki_list);
diff --git a/include/linux/batch_complete.h b/include/linux/batch_complete.h
new file mode 100644
index 0000000..8167a9d
--- /dev/null
+++ b/include/linux/batch_complete.h
@@ -0,0 +1,23 @@
+#ifndef _LINUX_BATCH_COMPLETE_H
+#define _LINUX_BATCH_COMPLETE_H
+
+#include <linux/rbtree.h>
+
+/*
+ * Common stuff to the aio and block code for batch completion. Everything
+ * important is elsewhere:
+ */
+
+struct bio;
+
+struct bio_list {
+ struct bio *head;
+ struct bio *tail;
+};
+
+struct batch_complete {
+ struct bio_list bio;
+ struct rb_root kiocb;
+};
+
+#endif
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1d077bd..d912a73 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -24,6 +24,7 @@
#include <linux/mempool.h>
#include <linux/ioprio.h>
#include <linux/bug.h>
+#include <linux/batch_complete.h>

#ifdef CONFIG_BLOCK

@@ -68,6 +69,8 @@
#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio) ((bio)->bi_size >> 9)

+void bio_endio_batch(struct bio *bio, int error, struct batch_complete *batch);
+
static inline unsigned int bio_cur_bytes(struct bio *bio)
{
if (bio->bi_vcnt)
@@ -241,7 +244,25 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)

}

-extern void bio_endio(struct bio *, int);
+/**
+ * bio_endio - end I/O on a bio
+ * @bio: bio
+ * @error: error, if any
+ *
+ * Description:
+ * bio_endio() will end I/O on the whole bio. bio_endio() is the
+ * preferred way to end I/O on a bio, it takes care of clearing
+ * BIO_UPTODATE on error. @error is 0 on success, and and one of the
+ * established -Exxxx (-EIO, for instance) error values in case
+ * something went wrong. No one should call bi_end_io() directly on a
+ * bio unless they own it and thus know that it has an end_io
+ * function.
+ **/
+static inline void bio_endio(struct bio *bio, int error)
+{
+ bio_endio_batch(bio, error, NULL);
+}
+
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);

@@ -420,10 +441,6 @@ static inline bool bio_mergeable(struct bio *bio)
* member of the bio. The bio_list also caches the last list member to allow
* fast access to the tail.
*/
-struct bio_list {
- struct bio *head;
- struct bio *tail;
-};

static inline int bio_list_empty(const struct bio_list *bl)
{
@@ -527,6 +544,15 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}

+static inline void batch_complete_init(struct batch_complete *batch)
+{
+ bio_list_init(&batch->bio);
+ batch->kiocb = RB_ROOT;
+}
+
+void batch_complete(struct batch_complete *batch);
+
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a3f578b..867976c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -43,6 +43,7 @@ struct bio {
* top bits priority
*/

+ short bi_error;
unsigned short bi_vcnt; /* how many bio_vec's */
unsigned short bi_idx; /* current index into bvl_vec */

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 78feda9..2f91edb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -877,7 +877,8 @@ extern struct request *blk_fetch_request(struct request_queue *q);
* This prevents code duplication in drivers.
*/
extern bool blk_update_request(struct request *rq, int error,
- unsigned int nr_bytes);
+ unsigned int nr_bytes,
+ struct batch_complete *batch);
extern bool blk_end_request(struct request *rq, int error,
unsigned int nr_bytes);
extern void blk_end_request_all(struct request *rq, int error);
@@ -885,10 +886,17 @@ extern bool blk_end_request_cur(struct request *rq, int error);
extern bool blk_end_request_err(struct request *rq, int error);
extern bool __blk_end_request(struct request *rq, int error,
unsigned int nr_bytes);
-extern void __blk_end_request_all(struct request *rq, int error);
extern bool __blk_end_request_cur(struct request *rq, int error);
extern bool __blk_end_request_err(struct request *rq, int error);

+extern void blk_end_request_all_batch(struct request *rq, int error,
+ struct batch_complete *batch);
+
+static inline void __blk_end_request_all(struct request *rq, int error)
+{
+ blk_end_request_all_batch(rq, error, NULL);
+}
+
extern void blk_complete_request(struct request *);
extern void __blk_complete_request(struct request *);
extern void blk_abort_request(struct request *);
--
1.8.1.3

2013-03-21 16:37:08

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 31/33] virtio-blk: convert to batch completion

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/block/virtio_blk.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 8ad21a2..5a9e04a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -210,7 +210,8 @@ static void virtblk_bio_send_flush_work(struct work_struct *work)
virtblk_bio_send_flush(vbr);
}

-static inline void virtblk_request_done(struct virtblk_req *vbr)
+static inline void virtblk_request_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;
struct request *req = vbr->req;
@@ -224,11 +225,12 @@ static inline void virtblk_request_done(struct virtblk_req *vbr)
req->errors = (error != 0);
}

- __blk_end_request_all(req, error);
+ blk_end_request_all_batch(req, error, batch);
mempool_free(vbr, vblk->pool);
}

-static inline void virtblk_bio_flush_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_flush_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;

@@ -237,12 +239,13 @@ static inline void virtblk_bio_flush_done(struct virtblk_req *vbr)
INIT_WORK(&vbr->work, virtblk_bio_send_data_work);
queue_work(virtblk_wq, &vbr->work);
} else {
- bio_endio(vbr->bio, virtblk_result(vbr));
+ bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
}

-static inline void virtblk_bio_data_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_data_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;

@@ -252,17 +255,18 @@ static inline void virtblk_bio_data_done(struct virtblk_req *vbr)
INIT_WORK(&vbr->work, virtblk_bio_send_flush_work);
queue_work(virtblk_wq, &vbr->work);
} else {
- bio_endio(vbr->bio, virtblk_result(vbr));
+ bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
}

-static inline void virtblk_bio_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
if (unlikely(vbr->flags & VBLK_IS_FLUSH))
- virtblk_bio_flush_done(vbr);
+ virtblk_bio_flush_done(vbr, batch);
else
- virtblk_bio_data_done(vbr);
+ virtblk_bio_data_done(vbr, batch);
}

static void virtblk_done(struct virtqueue *vq)
@@ -272,16 +276,19 @@ static void virtblk_done(struct virtqueue *vq)
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
do {
virtqueue_disable_cb(vq);
while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
if (vbr->bio) {
- virtblk_bio_done(vbr);
+ virtblk_bio_done(vbr, &batch);
bio_done = true;
} else {
- virtblk_request_done(vbr);
+ virtblk_request_done(vbr, &batch);
req_done = true;
}
}
@@ -291,6 +298,8 @@ static void virtblk_done(struct virtqueue *vq)
blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);

+ batch_complete(&batch);
+
if (bio_done)
wake_up(&vblk->queue_wait);
}
--
1.8.1.3

2013-03-21 16:37:05

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 32/33] mtip32xx: convert to batch completion

[[email protected]:
* changes for conversion to bio batch completion from Kent
* fix to apply the above changes cleanly on latest mtip32xx code
* batch bio completion changes in
* mtip_command_cleanup()
* mtip_timeout_function()
* mtip_handle_tfe()]

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Asai Thambi S P <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/block/mtip32xx/mtip32xx.c | 86 ++++++++++++++++++++++-----------------
drivers/block/mtip32xx/mtip32xx.h | 8 ++--
2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 11cc952..b84dda5 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -146,6 +146,9 @@ static void mtip_command_cleanup(struct driver_data *dd)
struct mtip_cmd *command;
struct mtip_port *port = dd->port;
static int in_progress;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (in_progress)
return;
@@ -161,11 +164,9 @@ static void mtip_command_cleanup(struct driver_data *dd)
command = &port->commands[commandindex];

if (atomic_read(&command->active)
- && (command->async_callback)) {
- command->async_callback(command->async_data,
- -ENODEV);
- command->async_callback = NULL;
- command->async_data = NULL;
+ && (command->bio)) {
+ bio_endio_batch(command->bio, -ENODEV, &batch);
+ command->bio = NULL;
}

dma_unmap_sg(&port->dd->pdev->dev,
@@ -173,9 +174,10 @@ static void mtip_command_cleanup(struct driver_data *dd)
command->scatter_ents,
command->direction);
}
+ up(&port->cmd_slot);
}

- up(&port->cmd_slot);
+ batch_complete(&batch);

set_bit(MTIP_DDF_CLEANUP_BIT, &dd->dd_flag);
in_progress = 0;
@@ -564,6 +566,9 @@ static void mtip_timeout_function(unsigned long int data)
unsigned int bit, group;
unsigned int num_command_slots;
unsigned long to, tagaccum[SLOTBITS_IN_LONGS];
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (unlikely(!port))
return;
@@ -606,11 +611,9 @@ static void mtip_timeout_function(unsigned long int data)
writel(1 << bit, port->completed[group]);

/* Call the async completion callback. */
- if (likely(command->async_callback))
- command->async_callback(command->async_data,
- -EIO);
- command->async_callback = NULL;
- command->comp_func = NULL;
+ if (likely(command->bio))
+ bio_endio_batch(command->bio, -EIO, &batch);
+ command->bio = NULL;

/* Unmap the DMA scatter list entries */
dma_unmap_sg(&port->dd->pdev->dev,
@@ -629,6 +632,8 @@ static void mtip_timeout_function(unsigned long int data)
}
}

+ batch_complete(&batch);
+
if (cmdto_cnt) {
print_tags(port->dd, "timed out", tagaccum, cmdto_cnt);
if (!test_bit(MTIP_PF_IC_ACTIVE_BIT, &port->flags)) {
@@ -679,7 +684,8 @@ static void mtip_timeout_function(unsigned long int data)
static void mtip_async_complete(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
struct mtip_cmd *command;
struct driver_data *dd = data;
@@ -696,11 +702,10 @@ static void mtip_async_complete(struct mtip_port *port,
}

/* Upper layer callback */
- if (likely(command->async_callback))
- command->async_callback(command->async_data, cb_status);
+ if (likely(command->bio))
+ bio_endio_batch(command->bio, cb_status, batch);

- command->async_callback = NULL;
- command->comp_func = NULL;
+ command->bio = NULL;

/* Unmap the DMA scatter list entries */
dma_unmap_sg(&dd->pdev->dev,
@@ -733,24 +738,22 @@ static void mtip_async_complete(struct mtip_port *port,
static void mtip_completion(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
- struct mtip_cmd *command = &port->commands[tag];
struct completion *waiting = data;
if (unlikely(status == PORT_IRQ_TF_ERR))
dev_warn(&port->dd->pdev->dev,
"Internal command %d completed with TFE\n", tag);

- command->async_callback = NULL;
- command->comp_func = NULL;
-
complete(waiting);
}

static void mtip_null_completion(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
return;
}
@@ -779,6 +782,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
unsigned char *buf;
char *fail_reason = NULL;
int fail_all_ncq_write = 0, fail_all_ncq_cmds = 0;
+ struct batch_complete batch;

dev_warn(&dd->pdev->dev, "Taskfile error\n");

@@ -796,13 +800,14 @@ static void mtip_handle_tfe(struct driver_data *dd)
atomic_inc(&cmd->active); /* active > 1 indicates error */
if (cmd->comp_data && cmd->comp_func) {
cmd->comp_func(port, MTIP_TAG_INTERNAL,
- cmd->comp_data, PORT_IRQ_TF_ERR);
+ cmd->comp_data, PORT_IRQ_TF_ERR, NULL);
}
goto handle_tfe_exit;
}

/* clear the tag accumulator */
memset(tagaccum, 0, SLOTBITS_IN_LONGS * sizeof(long));
+ batch_complete_init(&batch);

/* Loop through all the groups */
for (group = 0; group < dd->slot_groups; group++) {
@@ -829,7 +834,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
cmd->comp_func(port,
tag,
cmd->comp_data,
- 0);
+ 0, &batch);
} else {
dev_err(&port->dd->pdev->dev,
"Missing completion func for tag %d",
@@ -842,6 +847,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
}
}
}
+ batch_complete(&batch);

print_tags(dd, "completed (TFE)", tagaccum, cmd_cnt);

@@ -883,6 +889,7 @@ static void mtip_handle_tfe(struct driver_data *dd)

/* clear the tag accumulator */
memset(tagaccum, 0, SLOTBITS_IN_LONGS * sizeof(long));
+ batch_complete_init(&batch);

/* Loop through all the groups */
for (group = 0; group < dd->slot_groups; group++) {
@@ -916,7 +923,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
if (cmd->comp_func) {
cmd->comp_func(port, tag,
cmd->comp_data,
- -ENODATA);
+ -ENODATA, &batch);
}
continue;
}
@@ -946,13 +953,15 @@ static void mtip_handle_tfe(struct driver_data *dd)
port,
tag,
cmd->comp_data,
- PORT_IRQ_TF_ERR);
+ PORT_IRQ_TF_ERR, &batch);
else
dev_warn(&port->dd->pdev->dev,
"Bad completion for tag %d\n",
tag);
}
}
+
+ batch_complete(&batch);
print_tags(dd, "reissued (TFE)", tagaccum, cmd_cnt);

handle_tfe_exit:
@@ -973,6 +982,9 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
struct driver_data *dd = port->dd;
int tag, bit;
struct mtip_cmd *command;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (!completed) {
WARN_ON_ONCE(!completed);
@@ -997,7 +1009,8 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
port,
tag,
command->comp_data,
- 0);
+ 0,
+ &batch);
} else {
dev_warn(&dd->pdev->dev,
"Null completion "
@@ -1007,13 +1020,16 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
if (mtip_check_surprise_removal(
dd->pdev)) {
mtip_command_cleanup(dd);
- return;
+ goto out;
}
}
}
completed >>= 1;
}

+out:
+ batch_complete(&batch);
+
/* If last, re-enable interrupts */
if (atomic_dec_return(&dd->irq_workers_active) == 0)
writel(0xffffffff, dd->mmio + HOST_IRQ_STAT);
@@ -1034,7 +1050,7 @@ static inline void mtip_process_legacy(struct driver_data *dd, u32 port_stat)
cmd->comp_func(port,
MTIP_TAG_INTERNAL,
cmd->comp_data,
- 0);
+ 0, NULL);
return;
}
}
@@ -2554,8 +2570,8 @@ static int mtip_hw_ioctl(struct driver_data *dd, unsigned int cmd,
* None
*/
static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
- int nsect, int nents, int tag, void *callback,
- void *data, int dir)
+ int nsect, int nents, int tag,
+ struct bio *bio, int dir)
{
struct host_to_dev_fis *fis;
struct mtip_port *port = dd->port;
@@ -2610,12 +2626,7 @@ static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
command->comp_func = mtip_async_complete;
command->direction = dma_dir;

- /*
- * Set the completion function and data for the command passed
- * from the upper layer.
- */
- command->async_data = data;
- command->async_callback = callback;
+ command->bio = bio;

/*
* To prevent this command from being issued
@@ -3795,7 +3806,6 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
bio_sectors(bio),
nents,
tag,
- bio_endio,
bio,
bio_data_dir(bio));
} else
diff --git a/drivers/block/mtip32xx/mtip32xx.h b/drivers/block/mtip32xx/mtip32xx.h
index 3bffff5..af8c6f7 100644
--- a/drivers/block/mtip32xx/mtip32xx.h
+++ b/drivers/block/mtip32xx/mtip32xx.h
@@ -325,11 +325,9 @@ struct mtip_cmd {
void (*comp_func)(struct mtip_port *port,
int tag,
void *data,
- int status);
- /* Additional callback function that may be called by comp_func() */
- void (*async_callback)(void *data, int status);
-
- void *async_data; /* Addl. data passed to async_callback() */
+ int status,
+ struct batch_complete *batch);
+ struct bio *bio;

int scatter_ents; /* Number of scatter list entries used */

--
1.8.1.3

2013-03-21 16:38:13

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 27/33] aio: kill ki_key

ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 7 +++++--
include/linux/aio.h | 9 ++++-----
2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e4b1cc1..8f6fb4d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1257,7 +1257,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
}
}

- ret = put_user(req->ki_key, &user_iocb->aio_key);
+ ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
goto out_put_req;
@@ -1378,10 +1378,13 @@ static struct kiocb *lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb,

assert_spin_locked(&ctx->ctx_lock);

+ if (key != KIOCB_KEY)
+ return NULL;
+
/* TODO: use a hash or array, this sucks. */
list_for_each(pos, &ctx->active_reqs) {
struct kiocb *kiocb = list_kiocb(pos);
- if (kiocb->ki_obj.user == iocb && kiocb->ki_key == key)
+ if (kiocb->ki_obj.user == iocb)
return kiocb;
}
return NULL;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index f0a8481..7308836 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -12,7 +12,7 @@
struct kioctx;
struct kiocb;

-#define KIOCB_SYNC_KEY (~0U)
+#define KIOCB_KEY 0

/*
* We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
@@ -56,10 +56,9 @@ typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);
*/
struct kiocb {
atomic_t ki_users;
- unsigned ki_key; /* id of this request */

struct file *ki_filp;
- struct kioctx *ki_ctx; /* may be NULL for sync ops */
+ struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
ssize_t (*ki_retry)(struct kiocb *);
void (*ki_dtor)(struct kiocb *);
@@ -95,14 +94,14 @@ struct kiocb {

static inline bool is_sync_kiocb(struct kiocb *kiocb)
{
- return kiocb->ki_key == KIOCB_SYNC_KEY;
+ return kiocb->ki_ctx == NULL;
}

static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
*kiocb = (struct kiocb) {
.ki_users = ATOMIC_INIT(1),
- .ki_key = KIOCB_SYNC_KEY,
+ .ki_ctx = NULL,
.ki_filp = filp,
.ki_obj.tsk = current,
};
--
1.8.1.3

2013-03-21 16:38:15

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 26/33] aio: don't include aio.h in sched.h

Faster kernel compiles by way of fewer unnecessary includes.

[[email protected]: fix fallout]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/s390/hypfs/inode.c | 1 +
block/scsi_ioctl.c | 1 +
drivers/char/mem.c | 1 +
drivers/infiniband/hw/ipath/ipath_file_ops.c | 1 +
drivers/infiniband/hw/qib/qib_file_ops.c | 2 +-
drivers/scsi/sg.c | 1 +
drivers/staging/android/logger.c | 1 +
fs/9p/vfs_addr.c | 1 +
fs/afs/write.c | 1 +
fs/block_dev.c | 1 +
fs/btrfs/file.c | 1 +
fs/btrfs/inode.c | 1 +
fs/ceph/file.c | 1 +
fs/compat.c | 1 +
fs/direct-io.c | 1 +
fs/ecryptfs/file.c | 1 +
fs/ext2/inode.c | 1 +
fs/ext3/inode.c | 1 +
fs/ext4/file.c | 1 +
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 1 +
fs/ext4/page-io.c | 1 +
fs/f2fs/data.c | 1 +
fs/fat/inode.c | 1 +
fs/fuse/cuse.c | 1 +
fs/fuse/dev.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/aops.c | 1 +
fs/gfs2/file.c | 1 +
fs/hfs/inode.c | 1 +
fs/hfsplus/inode.c | 1 +
fs/jfs/inode.c | 1 +
fs/nilfs2/inode.c | 2 +-
fs/ntfs/file.c | 1 +
fs/ntfs/inode.c | 1 +
fs/ocfs2/aops.h | 2 ++
fs/ocfs2/inode.h | 2 ++
fs/pipe.c | 1 +
fs/read_write.c | 1 +
fs/reiserfs/inode.c | 1 +
fs/ubifs/file.c | 1 +
fs/udf/inode.c | 1 +
fs/xfs/xfs_aops.c | 1 +
fs/xfs/xfs_file.c | 1 +
include/linux/cgroup.h | 1 +
include/linux/pid_namespace.h | 1 +
include/linux/sched.h | 2 --
include/linux/writeback.h | 1 +
kernel/fork.c | 1 +
kernel/printk.c | 1 +
kernel/ptrace.c | 1 +
mm/page_io.c | 1 +
mm/shmem.c | 1 +
mm/swap.c | 1 +
security/keys/internal.h | 2 ++
security/keys/keyctl.c | 1 +
sound/core/pcm_native.c | 2 +-
57 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 5f7d7ba..7a539f4 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -21,6 +21,7 @@
#include <linux/module.h>
#include <linux/seq_file.h>
#include <linux/mount.h>
+#include <linux/aio.h>
#include <asm/ebcdic.h>
#include "hypfs.h"

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 9a87daa..a5ffcc9 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -27,6 +27,7 @@
#include <linux/ratelimit.h>
#include <linux/slab.h>
#include <linux/times.h>
+#include <linux/uio.h>
#include <asm/uaccess.h>

#include <scsi/scsi.h>
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index e49265f..1ccbe94 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -28,6 +28,7 @@
#include <linux/pfn.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/aio.h>

#include <asm/uaccess.h>

diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index aed8afe..6d7f453 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -40,6 +40,7 @@
#include <linux/slab.h>
#include <linux/highmem.h>
#include <linux/io.h>
+#include <linux/aio.h>
#include <linux/jiffies.h>
#include <linux/cpu.h>
#include <asm/pgtable.h>
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index 4f7aa30..b56c942 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -39,7 +39,7 @@
#include <linux/vmalloc.h>
#include <linux/highmem.h>
#include <linux/io.h>
-#include <linux/uio.h>
+#include <linux/aio.h>
#include <linux/jiffies.h>
#include <asm/pgtable.h>
#include <linux/delay.h>
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 9f0c465..df5e961 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -35,6 +35,7 @@ static int sg_version_num = 30534; /* 2 digits for each component */
#include <linux/sched.h>
#include <linux/string.h>
#include <linux/mm.h>
+#include <linux/aio.h>
#include <linux/errno.h>
#include <linux/mtio.h>
#include <linux/ioctl.h>
diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c
index dbc63cb..b4a313c 100644
--- a/drivers/staging/android/logger.c
+++ b/drivers/staging/android/logger.c
@@ -28,6 +28,7 @@
#include <linux/slab.h>
#include <linux/time.h>
#include <linux/vmalloc.h>
+#include <linux/aio.h>
#include "logger.h"

#include <asm/ioctls.h>
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 0ad61c6..055562c 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -33,6 +33,7 @@
#include <linux/pagemap.h>
#include <linux/idr.h>
#include <linux/sched.h>
+#include <linux/aio.h>
#include <net/9p/9p.h>
#include <net/9p/client.h>

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 7e03ead..a890db4 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -14,6 +14,7 @@
#include <linux/pagemap.h>
#include <linux/writeback.h>
#include <linux/pagevec.h>
+#include <linux/aio.h>
#include "internal.h"

static int afs_write_back_from_locked_page(struct afs_writeback *wb,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index aea605c..8fcd407 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -27,6 +27,7 @@
#include <linux/namei.h>
#include <linux/log2.h>
#include <linux/cleancache.h>
+#include <linux/aio.h>
#include <asm/uaccess.h>
#include "internal.h"

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5b4ea5f..85e5576 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -24,6 +24,7 @@
#include <linux/string.h>
#include <linux/backing-dev.h>
#include <linux/mpage.h>
+#include <linux/aio.h>
#include <linux/falloc.h>
#include <linux/swap.h>
#include <linux/writeback.h>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ca1b767..ca26188 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -32,6 +32,7 @@
#include <linux/writeback.h>
#include <linux/statfs.h>
#include <linux/compat.h>
+#include <linux/aio.h>
#include <linux/bit_spinlock.h>
#include <linux/xattr.h>
#include <linux/posix_acl.h>
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index bf338d9..eb09f41 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -7,6 +7,7 @@
#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/writeback.h>
+#include <linux/aio.h>

#include "super.h"
#include "mds_client.h"
diff --git a/fs/compat.c b/fs/compat.c
index d487985..00e7874 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -48,6 +48,7 @@
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/pagemap.h>
+#include <linux/aio.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f853263..4348b01 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
#include <linux/uio.h>
#include <linux/atomic.h>
#include <linux/prefetch.h>
+#include <linux/aio.h>

/*
* How many user pages to map in one call to get_user_pages(). This determines
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 63b1f54..201f0a0 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -31,6 +31,7 @@
#include <linux/security.h>
#include <linux/compat.h>
#include <linux/fs_stack.h>
+#include <linux/aio.h>
#include "ecryptfs_kernel.h"

/**
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index fe60cc1..0a87bb1 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -31,6 +31,7 @@
#include <linux/mpage.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
#include "xip.h"
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index d512c4b..eac4f04 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -27,6 +27,7 @@
#include <linux/writeback.h>
#include <linux/mpage.h>
#include <linux/namei.h>
+#include <linux/aio.h>
#include "ext3.h"
#include "xattr.h"
#include "acl.h"
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 64848b5..4959e29 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -23,6 +23,7 @@
#include <linux/jbd2.h>
#include <linux/mount.h>
#include <linux/path.h>
+#include <linux/aio.h>
#include <linux/quotaops.h>
#include <linux/pagevec.h>
#include "ext4.h"
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index b505a14..21de123 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -20,6 +20,7 @@
* ([email protected]), 1993, 1998
*/

+#include <linux/aio.h>
#include "ext4_jbd2.h"
#include "truncate.h"
#include "ext4_extents.h" /* Needed for EXT_MAX_BLOCKS */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9ea0cde..f513f3d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -37,6 +37,7 @@
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/ratelimit.h>
+#include <linux/aio.h>

#include "ext4_jbd2.h"
#include "xattr.h"
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 809b310..d9903af 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/mpage.h>
#include <linux/namei.h>
+#include <linux/aio.h>
#include <linux/uio.h>
#include <linux/bio.h>
#include <linux/workqueue.h>
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 7bd22a2..d0ed4ba 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -12,6 +12,7 @@
#include <linux/f2fs_fs.h>
#include <linux/buffer_head.h>
#include <linux/mpage.h>
+#include <linux/aio.h>
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/blkdev.h>
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index acf6e47..d1d502a 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
#include <linux/buffer_head.h>
#include <linux/exportfs.h>
#include <linux/mount.h>
+#include <linux/aio.h>
#include <linux/vfs.h>
#include <linux/parser.h>
#include <linux/uio.h>
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..06b5e08 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -38,6 +38,7 @@
#include <linux/device.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/aio.h>
#include <linux/kdev_t.h>
#include <linux/kthread.h>
#include <linux/list.h>
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 11dfa0c..06c569e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -19,6 +19,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/aio.h>

MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 34b80ba..42b265b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/compat.h>
#include <linux/swap.h>
+#include <linux/aio.h>

static const struct file_operations fuse_direct_io_file_operations;

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 24f414f..371bd14 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -20,6 +20,7 @@
#include <linux/swap.h>
#include <linux/gfs2_ondisk.h>
#include <linux/backing-dev.h>
+#include <linux/aio.h>

#include "gfs2.h"
#include "incore.h"
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 019f45e..1b78c78 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -25,6 +25,7 @@
#include <asm/uaccess.h>
#include <linux/dlm.h>
#include <linux/dlm_plock.h>
+#include <linux/aio.h>

#include "gfs2.h"
#include "incore.h"
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 3031dfd..a9d60d4 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -14,6 +14,7 @@
#include <linux/pagemap.h>
#include <linux/mpage.h>
#include <linux/sched.h>
+#include <linux/aio.h>

#include "hfs_fs.h"
#include "btree.h"
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 160ccc9..cdd181d 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -14,6 +14,7 @@
#include <linux/pagemap.h>
#include <linux/mpage.h>
#include <linux/sched.h>
+#include <linux/aio.h>

#include "hfsplus_fs.h"
#include "hfsplus_raw.h"
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index b7dc47b..1781f06 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -23,6 +23,7 @@
#include <linux/pagemap.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
+#include <linux/aio.h>
#include "jfs_incore.h"
#include "jfs_inode.h"
#include "jfs_filsys.h"
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6b49f14..1e92930 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -25,7 +25,7 @@
#include <linux/gfp.h>
#include <linux/mpage.h>
#include <linux/writeback.h>
-#include <linux/uio.h>
+#include <linux/aio.h>
#include "nilfs.h"
#include "btnode.h"
#include "segment.h"
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 5b2d4f0..600af8f 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -27,6 +27,7 @@
#include <linux/swap.h>
#include <linux/uio.h>
#include <linux/writeback.h>
+#include <linux/aio.h>

#include <asm/page.h>
#include <asm/uaccess.h>
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index d3e118c..2778b02 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -28,6 +28,7 @@
#include <linux/quotaops.h>
#include <linux/slab.h>
#include <linux/log2.h>
+#include <linux/aio.h>

#include "aops.h"
#include "attrib.h"
diff --git a/fs/ocfs2/aops.h b/fs/ocfs2/aops.h
index ffb2da3..f671e49 100644
--- a/fs/ocfs2/aops.h
+++ b/fs/ocfs2/aops.h
@@ -22,6 +22,8 @@
#ifndef OCFS2_AOPS_H
#define OCFS2_AOPS_H

+#include <linux/aio.h>
+
handle_t *ocfs2_start_walk_page_trans(struct inode *inode,
struct page *page,
unsigned from,
diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h
index 88924a3..c765bdf 100644
--- a/fs/ocfs2/inode.h
+++ b/fs/ocfs2/inode.h
@@ -28,6 +28,8 @@

#include "extent_map.h"

+struct iocb;
+
/* OCFS2 Inode Private Data */
struct ocfs2_inode_info
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 2234f3f..34a643d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -21,6 +21,7 @@
#include <linux/audit.h>
#include <linux/syscalls.h>
#include <linux/fcntl.h>
+#include <linux/aio.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
diff --git a/fs/read_write.c b/fs/read_write.c
index 0dabcf7..f2e6108 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -9,6 +9,7 @@
#include <linux/fcntl.h>
#include <linux/file.h>
#include <linux/uio.h>
+#include <linux/aio.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
#include <linux/export.h>
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ea5061f..77d6d47 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -18,6 +18,7 @@
#include <linux/writeback.h>
#include <linux/quotaops.h>
#include <linux/swap.h>
+#include <linux/aio.h>

int reiserfs_commit_write(struct file *f, struct page *page,
unsigned from, unsigned to);
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index f12189d..1437453 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -50,6 +50,7 @@
*/

#include "ubifs.h"
+#include <linux/aio.h>
#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/slab.h>
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 7a12e48..b6d15d3 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -38,6 +38,7 @@
#include <linux/slab.h>
#include <linux/crc-itu-t.h>
#include <linux/mpage.h>
+#include <linux/aio.h>

#include "udf_i.h"
#include "udf_sb.h"
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5f707e5..c24ce0e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -31,6 +31,7 @@
#include "xfs_vnodeops.h"
#include "xfs_trace.h"
#include "xfs_bmap.h"
+#include <linux/aio.h>
#include <linux/gfp.h>
#include <linux/mpage.h>
#include <linux/pagevec.h>
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f03bf1a..af5b84f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,7 @@
#include "xfs_ioctl.h"
#include "xfs_trace.h"

+#include <linux/aio.h>
#include <linux/dcache.h>
#include <linux/falloc.h>
#include <linux/pagevec.h>
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 900af59..906e5c5 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -27,6 +27,7 @@ struct cgroup_subsys;
struct inode;
struct cgroup;
struct css_id;
+struct eventfd_ctx;

extern int cgroup_init_early(void);
extern int cgroup_init(void);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 215e5e3..6051177 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -4,6 +4,7 @@
#include <linux/sched.h>
#include <linux/bug.h>
#include <linux/mm.h>
+#include <linux/workqueue.h>
#include <linux/threads.h>
#include <linux/nsproxy.h>
#include <linux/kref.h>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..8d72706 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -325,8 +325,6 @@ extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
struct nsproxy;
struct user_namespace;

-#include <linux/aio.h>
-
#ifdef CONFIG_MMU
extern void arch_pick_mmap_layout(struct mm_struct *mm);
extern unsigned long
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9a9367c..579a500 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -5,6 +5,7 @@
#define WRITEBACK_H

#include <linux/sched.h>
+#include <linux/workqueue.h>
#include <linux/fs.h>

DECLARE_PER_CPU(int, dirty_throttle_leaks);
diff --git a/kernel/fork.c b/kernel/fork.c
index 1766d32..a0aa193 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
#include <linux/khugepaged.h>
#include <linux/signalfd.h>
#include <linux/uprobes.h>
+#include <linux/aio.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
diff --git a/kernel/printk.c b/kernel/printk.c
index 0b31715..61e2c60 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/bootmem.h>
#include <linux/memblock.h>
+#include <linux/aio.h>
#include <linux/syscalls.h>
#include <linux/kexec.h>
#include <linux/kdb.h>
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index acbd284..5d4b4d7 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -17,6 +17,7 @@
#include <linux/ptrace.h>
#include <linux/security.h>
#include <linux/signal.h>
+#include <linux/uio.h>
#include <linux/audit.h>
#include <linux/pid_namespace.h>
#include <linux/syscalls.h>
diff --git a/mm/page_io.c b/mm/page_io.c
index 78eee32..c535d39 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -20,6 +20,7 @@
#include <linux/buffer_head.h>
#include <linux/writeback.h>
#include <linux/frontswap.h>
+#include <linux/aio.h>
#include <asm/pgtable.h>

static struct bio *get_swap_bio(gfp_t gfp_flags,
diff --git a/mm/shmem.c b/mm/shmem.c
index 1c44af7..96b49bf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
#include <linux/mm.h>
#include <linux/export.h>
#include <linux/swap.h>
+#include <linux/aio.h>

static struct vfsmount *shm_mnt;

diff --git a/mm/swap.c b/mm/swap.c
index 8a529a0..92a9be5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -30,6 +30,7 @@
#include <linux/backing-dev.h>
#include <linux/memcontrol.h>
#include <linux/gfp.h>
+#include <linux/uio.h>

#include "internal.h"

diff --git a/security/keys/internal.h b/security/keys/internal.h
index 8bbefc3..d4f1468 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -16,6 +16,8 @@
#include <linux/key-type.h>
#include <linux/task_work.h>

+struct iovec;
+
#ifdef __KDEBUG
#define kenter(FMT, ...) \
printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__)
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 4b5c948..33cfd27 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -22,6 +22,7 @@
#include <linux/err.h>
#include <linux/vmalloc.h>
#include <linux/security.h>
+#include <linux/uio.h>
#include <asm/uaccess.h>
#include "internal.h"

diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index 71ae86c..d470c35 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -25,7 +25,7 @@
#include <linux/slab.h>
#include <linux/time.h>
#include <linux/pm_qos.h>
-#include <linux/uio.h>
+#include <linux/aio.h>
#include <linux/dma-mapping.h>
#include <sound/core.h>
#include <sound/control.h>
--
1.8.1.3

2013-03-21 16:39:27

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 23/33] generic dynamic per cpu refcounting

This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t
but dynamically switches to per cpu refcounting when the rate of gets/puts
becomes too high.

It also implements two stage shutdown, as we need it to tear down the
percpu counts. Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in "shutting down mode" and
switches back to a single atomic refcount with the appropriate barriers
(synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only returns
true once, so callers don't have to reimplement shutdown synchronization.

For the sake of simplicity/efficiency, the heuristic is pretty simple - it
just switches to percpu refcounting if there are more than x gets in one
second (completely arbitrarily, 4096).

It'd be more correct to count the number of cache misses or something else
more profile driven, but doing so would require accessing the shared ref
twice per get - by just counting the number of gets(), we can stick that
counter in the high bits of the refcount and increment both with a single
atomic64_add(). But I expect this'll be good enough in practice.

[[email protected]: fix build]
[[email protected]: coding-style tweak]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/percpu-refcount.h | 114 +++++++++++++++++++
lib/Makefile | 2 +-
lib/percpu-refcount.c | 243 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 358 insertions(+), 1 deletion(-)
create mode 100644 include/linux/percpu-refcount.h
create mode 100644 lib/percpu-refcount.c

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 0000000..d0cf887
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,114 @@
+/*
+ * Dynamic percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet <[email protected]>
+ *
+ * This implements a refcount with similar semantics to atomic_t - atomic_inc(),
+ * atomic_dec_and_test() - but potentially percpu.
+ *
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ *
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0. After it returns,
+ * it's safe to drop the initial ref.
+ *
+ * BACKGROUND:
+ *
+ * Percpu refcounts are quite useful for performance, but if we blindly
+ * converted all refcounts to percpu counters we'd waste quite a bit of memory.
+ *
+ * Think about all the refcounts embedded in kobjects, files, etc. most of which
+ * aren't used much. These start out as simple atomic counters - a little bigger
+ * than a bare atomic_t, 16 bytes instead of 4 - but if we exceed some arbitrary
+ * number of gets in one second, we then switch to percpu counters.
+ *
+ * This heuristic isn't perfect because it'll fire if the refcount was only
+ * being used on one cpu; ideally we'd be able to count the number of cache
+ * misses on percpu_ref_get() or something similar, but that'd make the non
+ * percpu path significantly heavier/more complex. We can count the number of
+ * gets() without any extra atomic instructions on arches that support
+ * atomic64_t - simply by changing the atomic_inc() to atomic_add_return().
+ *
+ * USAGE:
+ *
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ *
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu() and sychronize_rcu() to remove
+ * the kioctx from the proccess's list of kioctxs - after that, there can't be
+ * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
+ * the initial ref with percpu_ref_put().
+ *
+ * Code that does a two stage shutdown like this often needs some kind of
+ * explicit synchronization to ensure the initial refcount can only be dropped
+ * once - percpu_ref_kill() does this for you, it returns true once and false if
+ * someone else already called it. The aio code uses it this way, but it's not
+ * necessary if the code has some other mechanism to synchronize teardown.
+ *
+ * As mentioned previously, we decide when to convert a ref to percpu counters
+ * in percpu_ref_get(). However, since percpu_ref_get() will often be called
+ * with rcu_read_lock() held, it's not done there - percpu_ref_get() returns
+ * true if the ref should be converted to percpu counters.
+ *
+ * The caller should then call percpu_ref_alloc() after dropping
+ * rcu_read_lock(); if there is an uncommonly used codepath where it's
+ * inconvenient to call percpu_ref_alloc() after get(), it may be safely skipped
+ * and percpu_ref_get() will return true again the next time the counter wraps
+ * around.
+ */
+
+#ifndef _LINUX_PERCPU_REFCOUNT_H
+#define _LINUX_PERCPU_REFCOUNT_H
+
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+struct percpu_ref {
+ atomic64_t count;
+ unsigned long pcpu_count;
+};
+
+void percpu_ref_init(struct percpu_ref *ref);
+void __percpu_ref_get(struct percpu_ref *ref, bool alloc);
+int percpu_ref_put(struct percpu_ref *ref);
+
+int percpu_ref_kill(struct percpu_ref *ref);
+int percpu_ref_dead(struct percpu_ref *ref);
+
+/**
+ * percpu_ref_get - increment a dynamic percpu refcount
+ *
+ * Increments @ref and possibly converts it to percpu counters. Must be called
+ * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock()
+ * to allocate percpu counters - if sleeping/allocation isn't safe for some
+ * other reason (e.g. a spinlock), see percpu_ref_get_noalloc().
+ *
+ * Analagous to atomic_inc().
+ */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+ __percpu_ref_get(ref, true);
+}
+
+/**
+ * percpu_ref_get_noalloc - increment a dynamic percpu refcount
+ *
+ * Increments @ref, to be used when it's not safe to allocate percpu counters.
+ * Must be called with rcu_read_lock() held.
+ *
+ * Analagous to atomic_inc().
+ */
+static inline void percpu_ref_get_noalloc(struct percpu_ref *ref)
+{
+ __percpu_ref_get(ref, false);
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index d7946ff..32f4455 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o
+ earlycpio.o percpu-refcount.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
new file mode 100644
index 0000000..79c6158
--- /dev/null
+++ b/lib/percpu-refcount.c
@@ -0,0 +1,243 @@
+#define pr_fmt(fmt) "%s: " fmt "\n", __func__
+
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/percpu-refcount.h>
+#include <linux/rcupdate.h>
+
+/*
+ * A percpu refcount can be in 4 different modes. The state is tracked in the
+ * low two bits of percpu_ref->pcpu_count:
+ *
+ * PCPU_REF_NONE - the initial state, no percpu counters allocated.
+ *
+ * PCPU_REF_PTR - using percpu counters for the refcount.
+ *
+ * PCPU_REF_DYING - we're shutting down so get()/put() should use the embedded
+ * atomic counter, but we're not finished updating the atomic counter from the
+ * percpu counters - this means that percpu_ref_put() can't check for the ref
+ * hitting 0 yet.
+ *
+ * PCPU_REF_DEAD - we've finished the teardown sequence, percpu_ref_put() should
+ * now check for the ref hitting 0.
+ *
+ * In PCPU_REF_NONE mode, we need to count the number of times percpu_ref_get()
+ * is called; this is done with the high bits of the raw atomic counter. We also
+ * track the time, in jiffies, when the get count last wrapped - this is done
+ * with the remaining bits of percpu_ref->percpu_count.
+ *
+ * So, when percpu_ref_get() is called it increments the get count and checks if
+ * it wrapped; if it did, it checks if the last time it wrapped was less than
+ * one second ago; if so, we want to allocate percpu counters.
+ *
+ * PCPU_COUNT_BITS determines the threshold where we convert to percpu: of the
+ * raw 64 bit counter, we use PCPU_COUNT_BITS for the refcount, and the
+ * remaining (high) bits to count the number of times percpu_ref_get() has been
+ * called. It's currently (completely arbitrarily) 16384 times in one second.
+ *
+ * Percpu mode (PCPU_REF_PTR):
+ *
+ * In percpu mode all we do on get and put is increment or decrement the cpu
+ * local counter, which is a 32 bit unsigned int.
+ *
+ * Note that all the gets() could be happening on one cpu, and all the puts() on
+ * another - the individual cpu counters can wrap (potentially many times).
+ *
+ * But this is fine because we don't need to check for the ref hitting 0 in
+ * percpu mode; before we set the state to PCPU_REF_DEAD we simply sum up all
+ * the percpu counters and add them to the atomic counter. Since addition and
+ * subtraction in modular arithmatic is still associative, the result will be
+ * correct.
+ */
+
+#define PCPU_COUNT_BITS 50
+#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1)
+
+#define PCPU_STATUS_BITS 2
+#define PCPU_STATUS_MASK ((1 << PCPU_STATUS_BITS) - 1)
+
+#define PCPU_REF_PTR 0
+#define PCPU_REF_NONE 1
+#define PCPU_REF_DYING 2
+#define PCPU_REF_DEAD 3
+
+#define REF_STATUS(count) (count & PCPU_STATUS_MASK)
+
+/**
+ * percpu_ref_init - initialize a dynamic percpu refcount
+ *
+ * Initializes the refcount in single atomic counter mode with a refcount of 1;
+ * analagous to atomic_set(ref, 1).
+ */
+void percpu_ref_init(struct percpu_ref *ref)
+{
+ unsigned long now = jiffies;
+
+ atomic64_set(&ref->count, 1);
+
+ now <<= PCPU_STATUS_BITS;
+ now |= PCPU_REF_NONE;
+
+ ref->pcpu_count = now;
+}
+
+static void percpu_ref_alloc(struct percpu_ref *ref, unsigned long pcpu_count)
+{
+ unsigned long new, now = jiffies;
+
+ now <<= PCPU_STATUS_BITS;
+ now |= PCPU_REF_NONE;
+
+ if (now - pcpu_count <= HZ << PCPU_STATUS_BITS) {
+ rcu_read_unlock();
+ new = (unsigned long) alloc_percpu(unsigned);
+ rcu_read_lock();
+
+ if (!new)
+ goto update_time;
+
+ BUG_ON(new & PCPU_STATUS_MASK);
+
+ if (cmpxchg(&ref->pcpu_count, pcpu_count, new) != pcpu_count)
+ free_percpu((void __percpu *) new);
+ else
+ pr_debug("created");
+ } else {
+update_time:
+ new = now;
+ cmpxchg(&ref->pcpu_count, pcpu_count, new);
+ }
+}
+
+void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
+{
+ unsigned long pcpu_count;
+ uint64_t v;
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
+ /* for rcu - we're not using rcu_dereference() */
+ smp_read_barrier_depends();
+ __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
+ } else {
+ v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS),
+ &ref->count);
+
+ if (!(v >> PCPU_COUNT_BITS) &&
+ REF_STATUS(pcpu_count) == PCPU_REF_NONE && alloc)
+ percpu_ref_alloc(ref, pcpu_count);
+ }
+}
+
+/**
+ * percpu_ref_put - decrement a dynamic percpu refcount
+ *
+ * Returns true if the result is 0, otherwise false; only checks for the ref
+ * hitting 0 after percpu_ref_kill() has been called. Analagous to
+ * atomic_dec_and_test().
+ */
+int percpu_ref_put(struct percpu_ref *ref)
+{
+ unsigned long pcpu_count;
+ uint64_t v;
+ int ret = 0;
+
+ rcu_read_lock();
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ switch (REF_STATUS(pcpu_count)) {
+ case PCPU_REF_PTR:
+ /* for rcu - we're not using rcu_dereference() */
+ smp_read_barrier_depends();
+ __this_cpu_dec(*((unsigned __percpu *) pcpu_count));
+ break;
+ case PCPU_REF_NONE:
+ case PCPU_REF_DYING:
+ atomic64_dec(&ref->count);
+ break;
+ case PCPU_REF_DEAD:
+ v = atomic64_dec_return(&ref->count);
+ v &= PCPU_COUNT_MASK;
+
+ ret = v == 0;
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
+ *
+ * Must be called before dropping the initial ref, so that percpu_ref_put()
+ * knows to check for the refcount hitting 0. If the refcount was in percpu
+ * mode, converts it back to single atomic counter mode.
+ *
+ * Returns true the first time called on @ref and false if @ref is already
+ * shutting down, so it may be used by the caller for synchronizing other parts
+ * of a two stage shutdown.
+ */
+int percpu_ref_kill(struct percpu_ref *ref)
+{
+ unsigned long old, new, status, pcpu_count;
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ do {
+ status = REF_STATUS(pcpu_count);
+
+ switch (status) {
+ case PCPU_REF_PTR:
+ new = PCPU_REF_DYING;
+ break;
+ case PCPU_REF_NONE:
+ new = PCPU_REF_DEAD;
+ break;
+ case PCPU_REF_DYING:
+ case PCPU_REF_DEAD:
+ return 0;
+ }
+
+ old = pcpu_count;
+ pcpu_count = cmpxchg(&ref->pcpu_count, old, new);
+ } while (pcpu_count != old);
+
+ if (status == PCPU_REF_PTR) {
+ unsigned count = 0, cpu;
+
+ synchronize_rcu();
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
+
+ pr_debug("global %lli pcpu %i",
+ atomic64_read(&ref->count) & PCPU_COUNT_MASK,
+ (int) count);
+
+ atomic64_add((int) count, &ref->count);
+ smp_wmb();
+ /* Between setting global count and setting PCPU_REF_DEAD */
+ ref->pcpu_count = PCPU_REF_DEAD;
+
+ free_percpu((unsigned __percpu *) pcpu_count);
+ }
+
+ return 1;
+}
+
+/**
+ * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
+ *
+ * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
+ */
+int percpu_ref_dead(struct percpu_ref *ref)
+{
+ unsigned status = REF_STATUS(ref->pcpu_count);
+
+ return status == PCPU_REF_DYING ||
+ status == PCPU_REF_DEAD;
+}
--
1.8.1.3

2013-03-21 16:39:41

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 22/33] aio: percpu reqs_available

See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 99 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index bc00304..603511d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -26,6 +26,7 @@
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/mmu_context.h>
+#include <linux/percpu.h>
#include <linux/slab.h>
#include <linux/timer.h>
#include <linux/aio.h>
@@ -59,6 +60,10 @@ struct aio_ring {

#define AIO_RING_PAGES 8

+struct kioctx_cpu {
+ unsigned reqs_available;
+};
+
struct kioctx {
atomic_t users;
atomic_t dead;
@@ -67,6 +72,13 @@ struct kioctx {
unsigned long user_id;
struct hlist_node list;

+ struct __percpu kioctx_cpu *cpu;
+
+ /*
+ * For percpu reqs_available, number of slots we move to/from global
+ * counter at a time:
+ */
+ unsigned req_batch;
/*
* This is what userspace passed to io_setup(), it's not used for
* anything but counting against the global max_reqs quota.
@@ -94,6 +106,8 @@ struct kioctx {
* so we avoid overflowing it: it's decremented (if positive)
* when allocating a kiocb and incremented when the resulting
* io_event is pulled off the ringbuffer.
+ *
+ * We batch accesses to it with a percpu version.
*/
atomic_t reqs_available;
} ____cacheline_aligned_in_smp;
@@ -281,6 +295,8 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
static void free_ioctx_rcu(struct rcu_head *head)
{
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+
+ free_percpu(ctx->cpu);
kmem_cache_free(kioctx_cachep, ctx);
}

@@ -294,7 +310,7 @@ static void free_ioctx(struct kioctx *ctx)
struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
- unsigned head, avail;
+ unsigned cpu, head, avail;

spin_lock_irq(&ctx->ctx_lock);

@@ -308,6 +324,13 @@ static void free_ioctx(struct kioctx *ctx)

spin_unlock_irq(&ctx->ctx_lock);

+ for_each_possible_cpu(cpu) {
+ struct kioctx_cpu *kcpu = per_cpu_ptr(ctx->cpu, cpu);
+
+ atomic_add(kcpu->reqs_available, &ctx->reqs_available);
+ kcpu->reqs_available = 0;
+ }
+
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
@@ -358,6 +381,18 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
struct kioctx *ctx;
int err = -ENOMEM;

+ /*
+ * We keep track of the number of available ringbuffer slots, to prevent
+ * overflow (reqs_available), and we also use percpu counters for this.
+ *
+ * So since up to half the slots might be on other cpu's percpu counters
+ * and unavailable, double nr_events so userspace sees what they
+ * expected: additionally, we move req_batch slots to/from percpu
+ * counters at a time, so make sure that isn't 0:
+ */
+ nr_events = max(nr_events, num_possible_cpus() * 4);
+ nr_events *= 2;
+
/* Prevent overflows */
if ((nr_events > (0x10000000U / sizeof(struct io_event))) ||
(nr_events > (0x10000000U / sizeof(struct kiocb)))) {
@@ -383,10 +418,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

INIT_LIST_HEAD(&ctx->active_reqs);

- if (aio_setup_ring(ctx) < 0)
+ ctx->cpu = alloc_percpu(struct kioctx_cpu);
+ if (!ctx->cpu)
goto out_freectx;

+ if (aio_setup_ring(ctx) < 0)
+ goto out_freepcpu;
+
atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
+ ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
+ BUG_ON(!ctx->req_batch);

/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
@@ -410,6 +451,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
+out_freepcpu:
+ free_percpu(ctx->cpu);
out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
@@ -508,6 +551,52 @@ void exit_aio(struct mm_struct *mm)
}
}

+static void put_reqs_available(struct kioctx *ctx, unsigned nr)
+{
+ struct kioctx_cpu *kcpu;
+
+ preempt_disable();
+ kcpu = this_cpu_ptr(ctx->cpu);
+
+ kcpu->reqs_available += nr;
+ while (kcpu->reqs_available >= ctx->req_batch * 2) {
+ kcpu->reqs_available -= ctx->req_batch;
+ atomic_add(ctx->req_batch, &ctx->reqs_available);
+ }
+
+ preempt_enable();
+}
+
+static bool get_reqs_available(struct kioctx *ctx)
+{
+ struct kioctx_cpu *kcpu;
+ bool ret = false;
+
+ preempt_disable();
+ kcpu = this_cpu_ptr(ctx->cpu);
+
+ if (!kcpu->reqs_available) {
+ int old, avail = atomic_read(&ctx->reqs_available);
+
+ do {
+ if (avail < ctx->req_batch)
+ goto out;
+
+ old = avail;
+ avail = atomic_cmpxchg(&ctx->reqs_available,
+ avail, avail - ctx->req_batch);
+ } while (avail != old);
+
+ kcpu->reqs_available += ctx->req_batch;
+ }
+
+ ret = true;
+ kcpu->reqs_available--;
+out:
+ preempt_enable();
+ return ret;
+}
+
/* aio_get_req
* Allocate a slot for an aio request. Increments the ki_users count
* of the kioctx so that the kioctx stays around until all requests are
@@ -522,7 +611,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;

- if (atomic_dec_if_positive(&ctx->reqs_available) <= 0)
+ if (!get_reqs_available(ctx))
return NULL;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
@@ -531,10 +620,9 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)

atomic_set(&req->ki_users, 2);
req->ki_ctx = ctx;
-
return req;
out_put:
- atomic_inc(&ctx->reqs_available);
+ put_reqs_available(ctx, 1);
return NULL;
}

@@ -623,6 +711,10 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
if (unlikely(xchg(&iocb->ki_cancel,
KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
+ /*
+ * Can't use the percpu reqs_available here - could race with
+ * free_ioctx()
+ */
atomic_inc(&ctx->reqs_available);
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
@@ -760,7 +852,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li h%u t%u\n", ret, head, ctx->tail);

- atomic_add(ret, &ctx->reqs_available);
+ put_reqs_available(ctx, ret);
out:
mutex_unlock(&ctx->ring_lock);

@@ -1193,7 +1285,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return 0;

out_put_req:
- atomic_inc(&ctx->reqs_available);
+ put_reqs_available(ctx, 1);
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
--
1.8.1.3

2013-03-21 16:39:56

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 21/33] aio: reqs_active -> reqs_available

The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.

So as prep work for that, we convert reqs_active to reqs_available.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b71691d..bc00304 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -89,7 +89,13 @@ struct kioctx {
struct work_struct rcu_work;

struct {
- atomic_t reqs_active;
+ /*
+ * This counts the number of available slots in the ringbuffer,
+ * so we avoid overflowing it: it's decremented (if positive)
+ * when allocating a kiocb and incremented when the resulting
+ * io_event is pulled off the ringbuffer.
+ */
+ atomic_t reqs_available;
} ____cacheline_aligned_in_smp;

struct {
@@ -306,17 +312,17 @@ static void free_ioctx(struct kioctx *ctx)
head = ring->head;
kunmap_atomic(ring);

- while (atomic_read(&ctx->reqs_active) > 0) {
+ while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
wait_event(ctx->wait, head != ctx->tail);

avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;

- atomic_sub(avail, &ctx->reqs_active);
+ atomic_add(avail, &ctx->reqs_available);
head += avail;
head %= ctx->nr_events;
}

- WARN_ON(atomic_read(&ctx->reqs_active) < 0);
+ WARN_ON(atomic_read(&ctx->reqs_available) > ctx->nr_events - 1);

aio_free_ring(ctx);

@@ -380,6 +386,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (aio_setup_ring(ctx) < 0)
goto out_freectx;

+ atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
+
/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
if (aio_nr + nr_events > aio_max_nr ||
@@ -482,7 +490,7 @@ void exit_aio(struct mm_struct *mm)
"exit_aio:ioctx still alive: %d %d %d\n",
atomic_read(&ctx->users),
atomic_read(&ctx->dead),
- atomic_read(&ctx->reqs_active));
+ atomic_read(&ctx->reqs_available));
/*
* We don't need to bother with munmap() here -
* exit_mmap(mm) is coming and it'll unmap everything.
@@ -514,12 +522,9 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;

- if (atomic_read(&ctx->reqs_active) >= ctx->nr_events)
+ if (atomic_dec_if_positive(&ctx->reqs_available) <= 0)
return NULL;

- if (atomic_inc_return(&ctx->reqs_active) > ctx->nr_events - 1)
- goto out_put;
-
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
goto out_put;
@@ -529,7 +534,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)

return req;
out_put:
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
return NULL;
}

@@ -600,7 +605,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

/*
* Take rcu_read_lock() in case the kioctx is being destroyed, as we
- * need to issue a wakeup after decrementing reqs_active.
+ * need to issue a wakeup after incrementing reqs_available.
*/
rcu_read_lock();

@@ -618,7 +623,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
if (unlikely(xchg(&iocb->ki_cancel,
KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
}
@@ -755,7 +760,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li h%u t%u\n", ret, head, ctx->tail);

- atomic_sub(ret, &ctx->reqs_active);
+ atomic_add(ret, &ctx->reqs_available);
out:
mutex_unlock(&ctx->ring_lock);

@@ -1188,7 +1193,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return 0;

out_put_req:
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
--
1.8.1.3

2013-03-21 16:36:31

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 15/33] aio: use flush_dcache_page()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 45 +++++++++++++++++----------------------------
1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e9511d4..ed9d3a3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -211,33 +211,15 @@ static int aio_setup_ring(struct kioctx *ctx)
ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
ring->header_length = sizeof(struct aio_ring);
kunmap_atomic(ring);
+ flush_dcache_page(info->ring_pages[0]);

return 0;
}

-
-/* aio_ring_event: returns a pointer to the event at the given index from
- * kmap_atomic(). Release the pointer with put_aio_ring_event();
- */
#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event))
#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event))
#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)

-#define aio_ring_event(info, nr) ({ \
- unsigned pos = (nr) + AIO_EVENTS_OFFSET; \
- struct io_event *__event; \
- __event = kmap_atomic( \
- (info)->ring_pages[pos / AIO_EVENTS_PER_PAGE]); \
- __event += pos % AIO_EVENTS_PER_PAGE; \
- __event; \
-})
-
-#define put_aio_ring_event(event) do { \
- struct io_event *__event = (event); \
- (void)__event; \
- kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \
-} while(0)
-
static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
{
@@ -648,9 +630,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
struct kioctx *ctx = iocb->ki_ctx;
struct aio_ring_info *info;
struct aio_ring *ring;
- struct io_event *event;
+ struct io_event *ev_page, *event;
unsigned long flags;
- unsigned long tail;
+ unsigned tail, pos;

/*
* Special case handling for sync iocbs:
@@ -689,19 +671,24 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
if (kiocbIsCancelled(iocb))
goto put_rq;

- ring = kmap_atomic(info->ring_pages[0]);
-
tail = info->tail;
- event = aio_ring_event(info, tail);
+ pos = tail + AIO_EVENTS_OFFSET;
+
if (++tail >= info->nr)
tail = 0;

+ ev_page = kmap_atomic(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+ event = ev_page + pos % AIO_EVENTS_PER_PAGE;
+
event->obj = (u64)(unsigned long)iocb->ki_obj.user;
event->data = iocb->ki_user_data;
event->res = res;
event->res2 = res2;

- pr_debug("%p[%lu]: %p: %p %Lx %lx %lx\n",
+ kunmap_atomic(ev_page);
+ flush_dcache_page(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+
+ pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
res, res2);

@@ -711,12 +698,13 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
smp_wmb(); /* make event visible before updating tail */

info->tail = tail;
- ring->tail = tail;

- put_aio_ring_event(event);
+ ring = kmap_atomic(info->ring_pages[0]);
+ ring->tail = tail;
kunmap_atomic(ring);
+ flush_dcache_page(info->ring_pages[0]);

- pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+ pr_debug("added to ring %p at [%u]\n", iocb, tail);

/*
* Check if the user asked us to deliver the result through an
@@ -804,6 +792,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
ring = kmap_atomic(info->ring_pages[0]);
ring->head = head;
kunmap_atomic(ring);
+ flush_dcache_page(info->ring_pages[0]);

pr_debug("%li h%u t%u\n", ret, head, info->tail);
out:
--
1.8.1.3

2013-03-21 16:40:36

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 19/33] aio: kill struct aio_ring_info

struct aio_ring_info was kind of odd, the only place it's used is where
it's embedded in struct kioctx - there's no real need for it.

The next patch rearranges struct kioctx and puts various things on their
own cachelines - getting rid of struct aio_ring_info now makes that
reordering a bit clearer.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 156 ++++++++++++++++++++++++++++++---------------------------------
1 file changed, 75 insertions(+), 81 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 95fcd08..0e283ad 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -58,18 +58,6 @@ struct aio_ring {
}; /* 128 bytes + ring size */

#define AIO_RING_PAGES 8
-struct aio_ring_info {
- unsigned long mmap_base;
- unsigned long mmap_size;
-
- struct page **ring_pages;
- struct mutex ring_lock;
- long nr_pages;
-
- unsigned nr, tail;
-
- struct page *internal_pages[AIO_RING_PAGES];
-};

struct kioctx {
atomic_t users;
@@ -90,14 +78,30 @@ struct kioctx {
* This is what userspace passed to io_setup(), it's not used for
* anything but counting against the global max_reqs quota.
*
- * The real limit is ring->nr - 1, which will be larger (see
+ * The real limit is nr_events - 1, which will be larger (see
* aio_setup_ring())
*/
unsigned max_reqs;

- struct aio_ring_info ring_info;
+ /* Size of ringbuffer, in units of struct io_event */
+ unsigned nr_events;

- spinlock_t completion_lock;
+ unsigned long mmap_base;
+ unsigned long mmap_size;
+
+ struct page **ring_pages;
+ long nr_pages;
+
+ struct {
+ struct mutex ring_lock;
+ } ____cacheline_aligned;
+
+ struct {
+ unsigned tail;
+ spinlock_t completion_lock;
+ } ____cacheline_aligned;
+
+ struct page *internal_pages[AIO_RING_PAGES];

struct rcu_head rcu_head;
struct work_struct rcu_work;
@@ -129,26 +133,21 @@ __initcall(aio_setup);

static void aio_free_ring(struct kioctx *ctx)
{
- struct aio_ring_info *info = &ctx->ring_info;
long i;

- for (i=0; i<info->nr_pages; i++)
- put_page(info->ring_pages[i]);
+ for (i = 0; i < ctx->nr_pages; i++)
+ put_page(ctx->ring_pages[i]);

- if (info->mmap_size) {
- vm_munmap(info->mmap_base, info->mmap_size);
- }
+ if (ctx->mmap_size)
+ vm_munmap(ctx->mmap_base, ctx->mmap_size);

- if (info->ring_pages && info->ring_pages != info->internal_pages)
- kfree(info->ring_pages);
- info->ring_pages = NULL;
- info->nr = 0;
+ if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
+ kfree(ctx->ring_pages);
}

static int aio_setup_ring(struct kioctx *ctx)
{
struct aio_ring *ring;
- struct aio_ring_info *info = &ctx->ring_info;
unsigned nr_events = ctx->max_reqs;
struct mm_struct *mm = current->mm;
unsigned long size, populate;
@@ -166,45 +165,44 @@ static int aio_setup_ring(struct kioctx *ctx)

nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event);

- info->nr = 0;
- info->ring_pages = info->internal_pages;
+ ctx->nr_events = 0;
+ ctx->ring_pages = ctx->internal_pages;
if (nr_pages > AIO_RING_PAGES) {
- info->ring_pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
- if (!info->ring_pages)
+ ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *),
+ GFP_KERNEL);
+ if (!ctx->ring_pages)
return -ENOMEM;
}

- info->mmap_size = nr_pages * PAGE_SIZE;
- pr_debug("attempting mmap of %lu bytes\n", info->mmap_size);
+ ctx->mmap_size = nr_pages * PAGE_SIZE;
+ pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
down_write(&mm->mmap_sem);
- info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size,
- PROT_READ|PROT_WRITE,
- MAP_ANONYMOUS|MAP_PRIVATE, 0,
- &populate);
- if (IS_ERR((void *)info->mmap_base)) {
+ ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
+ PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate);
+ if (IS_ERR((void *)ctx->mmap_base)) {
up_write(&mm->mmap_sem);
- info->mmap_size = 0;
+ ctx->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}

- pr_debug("mmap address: 0x%08lx\n", info->mmap_base);
- info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages,
- 1, 0, info->ring_pages, NULL);
+ pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
+ ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
+ 1, 0, ctx->ring_pages, NULL);
up_write(&mm->mmap_sem);

- if (unlikely(info->nr_pages != nr_pages)) {
+ if (unlikely(ctx->nr_pages != nr_pages)) {
aio_free_ring(ctx);
return -EAGAIN;
}
if (populate)
- mm_populate(info->mmap_base, populate);
+ mm_populate(ctx->mmap_base, populate);

- ctx->user_id = info->mmap_base;
+ ctx->user_id = ctx->mmap_base;
+ ctx->nr_events = nr_events; /* trusted copy */

- info->nr = nr_events; /* trusted copy */
-
- ring = kmap_atomic(info->ring_pages[0]);
+ ring = kmap_atomic(ctx->ring_pages[0]);
ring->nr = nr_events; /* user copy */
ring->id = ctx->user_id;
ring->head = ring->tail = 0;
@@ -213,7 +211,7 @@ static int aio_setup_ring(struct kioctx *ctx)
ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
ring->header_length = sizeof(struct aio_ring);
kunmap_atomic(ring);
- flush_dcache_page(info->ring_pages[0]);
+ flush_dcache_page(ctx->ring_pages[0]);

return 0;
}
@@ -284,7 +282,6 @@ static void free_ioctx_rcu(struct rcu_head *head)
*/
static void free_ioctx(struct kioctx *ctx)
{
- struct aio_ring_info *info = &ctx->ring_info;
struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
@@ -302,18 +299,18 @@ static void free_ioctx(struct kioctx *ctx)

spin_unlock_irq(&ctx->ctx_lock);

- ring = kmap_atomic(info->ring_pages[0]);
+ ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);

while (atomic_read(&ctx->reqs_active) > 0) {
- wait_event(ctx->wait, head != info->tail);
+ wait_event(ctx->wait, head != ctx->tail);

- avail = (head <= info->tail ? info->tail : info->nr) - head;
+ avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;

atomic_sub(avail, &ctx->reqs_active);
head += avail;
- head %= info->nr;
+ head %= ctx->nr_events;
}

WARN_ON(atomic_read(&ctx->reqs_active) < 0);
@@ -372,7 +369,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
atomic_set(&ctx->dead, 0);
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->completion_lock);
- mutex_init(&ctx->ring_info.ring_lock);
+ mutex_init(&ctx->ring_lock);
init_waitqueue_head(&ctx->wait);

INIT_LIST_HEAD(&ctx->active_reqs);
@@ -396,7 +393,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
spin_unlock(&mm->ioctx_lock);

pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
- ctx, ctx->user_id, mm, ctx->ring_info.nr);
+ ctx, ctx->user_id, mm, ctx->nr_events);
return ctx;

out_cleanup:
@@ -491,7 +488,7 @@ void exit_aio(struct mm_struct *mm)
* just set it to 0; aio_free_ring() is the only
* place that uses ->mmap_size, so it's safe.
*/
- ctx->ring_info.mmap_size = 0;
+ ctx->mmap_size = 0;

if (!atomic_xchg(&ctx->dead, 1)) {
hlist_del_rcu(&ctx->list);
@@ -514,10 +511,10 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;

- if (atomic_read(&ctx->reqs_active) >= ctx->ring_info.nr)
+ if (atomic_read(&ctx->reqs_active) >= ctx->nr_events)
return NULL;

- if (atomic_inc_return(&ctx->reqs_active) > ctx->ring_info.nr - 1)
+ if (atomic_inc_return(&ctx->reqs_active) > ctx->nr_events - 1)
goto out_put;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
@@ -578,7 +575,6 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
void aio_complete(struct kiocb *iocb, long res, long res2)
{
struct kioctx *ctx = iocb->ki_ctx;
- struct aio_ring_info *info;
struct aio_ring *ring;
struct io_event *ev_page, *event;
unsigned long flags;
@@ -599,8 +595,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
return;
}

- info = &ctx->ring_info;
-
/*
* Take rcu_read_lock() in case the kioctx is being destroyed, as we
* need to issue a wakeup after decrementing reqs_active.
@@ -633,13 +627,13 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
spin_lock_irqsave(&ctx->completion_lock, flags);

- tail = info->tail;
+ tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;

- if (++tail >= info->nr)
+ if (++tail >= ctx->nr_events)
tail = 0;

- ev_page = kmap_atomic(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+ ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;

event->obj = (u64)(unsigned long)iocb->ki_obj.user;
@@ -648,7 +642,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
event->res2 = res2;

kunmap_atomic(ev_page);
- flush_dcache_page(info->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+ flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
@@ -659,12 +653,12 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
smp_wmb(); /* make event visible before updating tail */

- info->tail = tail;
+ ctx->tail = tail;

- ring = kmap_atomic(info->ring_pages[0]);
+ ring = kmap_atomic(ctx->ring_pages[0]);
ring->tail = tail;
kunmap_atomic(ring);
- flush_dcache_page(info->ring_pages[0]);
+ flush_dcache_page(ctx->ring_pages[0]);

spin_unlock_irqrestore(&ctx->completion_lock, flags);

@@ -704,29 +698,29 @@ EXPORT_SYMBOL(aio_complete);
static long aio_read_events_ring(struct kioctx *ctx,
struct io_event __user *event, long nr)
{
- struct aio_ring_info *info = &ctx->ring_info;
struct aio_ring *ring;
unsigned head, pos;
long ret = 0;
int copy_ret;

- mutex_lock(&info->ring_lock);
+ mutex_lock(&ctx->ring_lock);

- ring = kmap_atomic(info->ring_pages[0]);
+ ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);

- pr_debug("h%u t%u m%u\n", head, info->tail, info->nr);
+ pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);

- if (head == info->tail)
+ if (head == ctx->tail)
goto out;

while (ret < nr) {
- long avail = (head <= info->tail ? info->tail : info->nr) - head;
+ long avail = (head <= ctx->tail
+ ? ctx->tail : ctx->nr_events) - head;
struct io_event *ev;
struct page *page;

- if (head == info->tail)
+ if (head == ctx->tail)
break;

avail = min(avail, nr - ret);
@@ -734,7 +728,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
((head + AIO_EVENTS_OFFSET) % AIO_EVENTS_PER_PAGE));

pos = head + AIO_EVENTS_OFFSET;
- page = info->ring_pages[pos / AIO_EVENTS_PER_PAGE];
+ page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE];
pos %= AIO_EVENTS_PER_PAGE;

ev = kmap(page);
@@ -748,19 +742,19 @@ static long aio_read_events_ring(struct kioctx *ctx,

ret += avail;
head += avail;
- head %= info->nr;
+ head %= ctx->nr_events;
}

- ring = kmap_atomic(info->ring_pages[0]);
+ ring = kmap_atomic(ctx->ring_pages[0]);
ring->head = head;
kunmap_atomic(ring);
- flush_dcache_page(info->ring_pages[0]);
+ flush_dcache_page(ctx->ring_pages[0]);

- pr_debug("%li h%u t%u\n", ret, head, info->tail);
+ pr_debug("%li h%u t%u\n", ret, head, ctx->tail);

atomic_sub(ret, &ctx->reqs_active);
out:
- mutex_unlock(&info->ring_lock);
+ mutex_unlock(&ctx->ring_lock);

return ret;
}
--
1.8.1.3

2013-03-21 16:41:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 17/33] aio: change reqs_active to include unreaped completions

The aio code tries really hard to avoid having to deal with the completion
ringbuffer overflowing. To do that, it has to keep track of the number of
outstanding kiocbs, and the number of completions currently in the
ringbuffer - and it's got to check that every time we allocate a kiocb.
Ouch.

But - we can improve this quite a bit if we just change reqs_active to
mean "number of outstanding requests and unreaped completions" - that
means kiocb allocation doesn't have to look at the ringbuffer, which is a
fairly significant win.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 46 ++++++++++++++++++++++++++++++++--------------
1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 16050fa..6828a31 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -71,12 +71,6 @@ struct aio_ring_info {
struct page *internal_pages[AIO_RING_PAGES];
};

-static inline unsigned aio_ring_avail(struct aio_ring_info *info,
- struct aio_ring *ring)
-{
- return (ring->head + info->nr - 1 - ring->tail) % info->nr;
-}
-
struct kioctx {
atomic_t users;
atomic_t dead;
@@ -92,7 +86,13 @@ struct kioctx {
atomic_t reqs_active;
struct list_head active_reqs; /* used for cancellation */

- /* sys_io_setup currently limits this to an unsigned int */
+ /*
+ * This is what userspace passed to io_setup(), it's not used for
+ * anything but counting against the global max_reqs quota.
+ *
+ * The real limit is ring->nr - 1, which will be larger (see
+ * aio_setup_ring())
+ */
unsigned max_reqs;

struct aio_ring_info ring_info;
@@ -284,8 +284,11 @@ static void free_ioctx_rcu(struct rcu_head *head)
*/
static void free_ioctx(struct kioctx *ctx)
{
+ struct aio_ring_info *info = &ctx->ring_info;
+ struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
+ unsigned head, avail;

spin_lock_irq(&ctx->ctx_lock);

@@ -299,7 +302,21 @@ static void free_ioctx(struct kioctx *ctx)

spin_unlock_irq(&ctx->ctx_lock);

- wait_event(ctx->wait, !atomic_read(&ctx->reqs_active));
+ ring = kmap_atomic(info->ring_pages[0]);
+ head = ring->head;
+ kunmap_atomic(ring);
+
+ while (atomic_read(&ctx->reqs_active) > 0) {
+ wait_event(ctx->wait, head != info->tail);
+
+ avail = (head <= info->tail ? info->tail : info->nr) - head;
+
+ atomic_sub(avail, &ctx->reqs_active);
+ head += avail;
+ head %= info->nr;
+ }
+
+ WARN_ON(atomic_read(&ctx->reqs_active) < 0);

aio_free_ring(ctx);

@@ -548,7 +565,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
unsigned short allocated, to_alloc;
long avail;
struct kiocb *req, *n;
- struct aio_ring *ring;

to_alloc = min(batch->count, KIOCB_BATCH_SIZE);
for (allocated = 0; allocated < to_alloc; allocated++) {
@@ -563,9 +579,8 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
goto out;

spin_lock_irq(&ctx->ctx_lock);
- ring = kmap_atomic(ctx->ring_info.ring_pages[0]);

- avail = aio_ring_avail(&ctx->ring_info, ring) - atomic_read(&ctx->reqs_active);
+ avail = ctx->ring_info.nr - atomic_read(&ctx->reqs_active) - 1;
BUG_ON(avail < 0);
if (avail < allocated) {
/* Trim back the number of requests. */
@@ -580,7 +595,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
batch->count -= allocated;
atomic_add(allocated, &ctx->reqs_active);

- kunmap_atomic(ring);
spin_unlock_irq(&ctx->ctx_lock);

out:
@@ -687,8 +701,11 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
* when the event got cancelled.
*/
if (unlikely(xchg(&iocb->ki_cancel,
- KIOCB_CANCELLED) == KIOCB_CANCELLED))
+ KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
+ atomic_dec(&ctx->reqs_active);
+ /* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
+ }

/*
* Add a completion event to the ring buffer. Must be done holding
@@ -745,7 +762,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
put_rq:
/* everything turned out well, dispose of the aiocb. */
aio_put_req(iocb);
- atomic_dec(&ctx->reqs_active);

/*
* We have to order our ring_info tail store above and test
@@ -822,6 +838,8 @@ static long aio_read_events_ring(struct kioctx *ctx,
flush_dcache_page(info->ring_pages[0]);

pr_debug("%li h%u t%u\n", ret, head, info->tail);
+
+ atomic_sub(ret, &ctx->reqs_active);
out:
mutex_unlock(&info->ring_lock);

--
1.8.1.3

2013-03-21 16:41:29

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 02/33] aio: remove dead code from aio.h

From: Zach Brown <[email protected]>

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/aio.h | 24 ------------------------
1 file changed, 24 deletions(-)

diff --git a/include/linux/aio.h b/include/linux/aio.h
index 31ff6db..b46a09f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -9,44 +9,22 @@

#include <linux/atomic.h>

-#define AIO_MAXSEGS 4
-#define AIO_KIOGRP_NR_ATOMIC 8
-
struct kioctx;

-/* Notes on cancelling a kiocb:
- * If a kiocb is cancelled, aio_complete may return 0 to indicate
- * that cancel has not yet disposed of the kiocb. All cancel
- * operations *must* call aio_put_req to dispose of the kiocb
- * to guard against races with the completion code.
- */
-#define KIOCB_C_CANCELLED 0x01
-#define KIOCB_C_COMPLETE 0x02
-
#define KIOCB_SYNC_KEY (~0U)

/* ki_flags bits */
-/*
- * This may be used for cancel/retry serialization in the future, but
- * for now it's unused and we probably don't want modules to even
- * think they can use it.
- */
-/* #define KIF_LOCKED 0 */
#define KIF_KICKED 1
#define KIF_CANCELLED 2

-#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)

-#define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)

-#define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)

-#define kiocbIsLocked(iocb) test_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)

@@ -207,8 +185,6 @@ struct kioctx {
};

/* prototypes */
-extern unsigned aio_max_size;
-
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
extern int aio_put_req(struct kiocb *iocb);
--
1.8.1.3

2013-03-21 16:41:51

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 16/33] aio: use cancellation list lazily

Cancelling kiocbs requires adding them to a per kioctx linked list, which
is one of the few things we need to take the kioctx lock for in the fast
path. But most kiocbs can't be cancelled - so if we just do this lazily,
we can avoid quite a bit of locking overhead.

While we're at it, instead of using a flag bit switch to using ki_cancel
itself to indicate that a kiocb has been cancelled/completed. This lets
us get rid of ki_flags entirely.

[[email protected]: remove buggy BUG()]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/usb/gadget/inode.c | 3 +-
fs/aio.c | 106 ++++++++++++++++++++++++++-------------------
include/linux/aio.h | 27 ++++++++----
3 files changed, 81 insertions(+), 55 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 525cee4..5cc4e7ee 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -533,7 +533,6 @@ static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
local_irq_disable();
epdata = priv->epdata;
// spin_lock(&epdata->dev->lock);
- kiocbSetCancelled(iocb);
if (likely(epdata && epdata->ep && priv->req))
value = usb_ep_dequeue (epdata->ep, priv->req);
else
@@ -663,7 +662,7 @@ fail:
goto fail;
}

- iocb->ki_cancel = ep_aio_cancel;
+ kiocb_set_cancel_fn(iocb, ep_aio_cancel);
get_ep(epdata);
priv->epdata = epdata;
priv->actual = 0;
diff --git a/fs/aio.c b/fs/aio.c
index ed9d3a3..16050fa 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -97,6 +97,8 @@ struct kioctx {

struct aio_ring_info ring_info;

+ spinlock_t completion_lock;
+
struct rcu_head rcu_head;
struct work_struct rcu_work;
};
@@ -220,25 +222,51 @@ static int aio_setup_ring(struct kioctx *ctx)
#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event))
#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)

+void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel)
+{
+ struct kioctx *ctx = req->ki_ctx;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+
+ if (!req->ki_list.next)
+ list_add(&req->ki_list, &ctx->active_reqs);
+
+ req->ki_cancel = cancel;
+
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+}
+EXPORT_SYMBOL(kiocb_set_cancel_fn);
+
static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
{
- int (*cancel)(struct kiocb *, struct io_event *);
+ kiocb_cancel_fn *old, *cancel;
int ret = -EINVAL;

- cancel = kiocb->ki_cancel;
- kiocbSetCancelled(kiocb);
- if (cancel) {
- atomic_inc(&kiocb->ki_users);
- spin_unlock_irq(&ctx->ctx_lock);
+ /*
+ * Don't want to set kiocb->ki_cancel = KIOCB_CANCELLED unless it
+ * actually has a cancel function, hence the cmpxchg()
+ */
+
+ cancel = ACCESS_ONCE(kiocb->ki_cancel);
+ do {
+ if (!cancel || cancel == KIOCB_CANCELLED)
+ return ret;

- memset(res, 0, sizeof(*res));
- res->obj = (u64)(unsigned long)kiocb->ki_obj.user;
- res->data = kiocb->ki_user_data;
- ret = cancel(kiocb, res);
+ old = cancel;
+ cancel = cmpxchg(&kiocb->ki_cancel, old, KIOCB_CANCELLED);
+ } while (cancel != old);

- spin_lock_irq(&ctx->ctx_lock);
- }
+ atomic_inc(&kiocb->ki_users);
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ memset(res, 0, sizeof(*res));
+ res->obj = (u64)(unsigned long)kiocb->ki_obj.user;
+ res->data = kiocb->ki_user_data;
+ ret = cancel(kiocb, res);
+
+ spin_lock_irq(&ctx->ctx_lock);

return ret;
}
@@ -326,6 +354,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
atomic_set(&ctx->users, 2);
atomic_set(&ctx->dead, 0);
spin_lock_init(&ctx->ctx_lock);
+ spin_lock_init(&ctx->completion_lock);
mutex_init(&ctx->ring_info.ring_lock);
init_waitqueue_head(&ctx->wait);

@@ -468,20 +497,12 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
{
struct kiocb *req = NULL;

- req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+ req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
return NULL;

- req->ki_flags = 0;
atomic_set(&req->ki_users, 2);
- req->ki_key = 0;
req->ki_ctx = ctx;
- req->ki_cancel = NULL;
- req->ki_retry = NULL;
- req->ki_dtor = NULL;
- req->private = NULL;
- req->ki_iovec = NULL;
- req->ki_eventfd = NULL;

return req;
}
@@ -512,7 +533,6 @@ static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
spin_lock_irq(&ctx->ctx_lock);
list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
list_del(&req->ki_batch);
- list_del(&req->ki_list);
kmem_cache_free(kiocb_cachep, req);
atomic_dec(&ctx->reqs_active);
}
@@ -558,10 +578,7 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
}

batch->count -= allocated;
- list_for_each_entry(req, &batch->head, ki_batch) {
- list_add(&req->ki_list, &ctx->active_reqs);
- atomic_inc(&ctx->reqs_active);
- }
+ atomic_add(allocated, &ctx->reqs_active);

kunmap_atomic(ring);
spin_unlock_irq(&ctx->ctx_lock);
@@ -652,25 +669,34 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
info = &ctx->ring_info;

/*
- * Add a completion event to the ring buffer. Must be done holding
- * ctx->ctx_lock to prevent other code from messing with the tail
- * pointer since we might be called from irq context.
- *
* Take rcu_read_lock() in case the kioctx is being destroyed, as we
* need to issue a wakeup after decrementing reqs_active.
*/
rcu_read_lock();
- spin_lock_irqsave(&ctx->ctx_lock, flags);

- list_del(&iocb->ki_list); /* remove from active_reqs */
+ if (iocb->ki_list.next) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ list_del(&iocb->ki_list);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ }

/*
* cancelled requests don't get events, userland was given one
* when the event got cancelled.
*/
- if (kiocbIsCancelled(iocb))
+ if (unlikely(xchg(&iocb->ki_cancel,
+ KIOCB_CANCELLED) == KIOCB_CANCELLED))
goto put_rq;

+ /*
+ * Add a completion event to the ring buffer. Must be done holding
+ * ctx->ctx_lock to prevent other code from messing with the tail
+ * pointer since we might be called from irq context.
+ */
+ spin_lock_irqsave(&ctx->completion_lock, flags);
+
tail = info->tail;
pos = tail + AIO_EVENTS_OFFSET;

@@ -704,6 +730,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
kunmap_atomic(ring);
flush_dcache_page(info->ring_pages[0]);

+ spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
pr_debug("added to ring %p at [%u]\n", iocb, tail);

/*
@@ -730,7 +758,6 @@ put_rq:
if (waitqueue_active(&ctx->wait))
wake_up(&ctx->wait);

- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
rcu_read_unlock();
}
EXPORT_SYMBOL(aio_complete);
@@ -1209,15 +1236,10 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
req->ki_opcode = iocb->aio_lio_opcode;

ret = aio_setup_iocb(req, compat);
-
if (ret)
goto out_put_req;

- if (unlikely(kiocbIsCancelled(req))) {
- ret = -EINTR;
- } else {
- ret = req->ki_retry(req);
- }
+ ret = req->ki_retry(req);
if (ret != -EIOCBQUEUED) {
/*
* There's no easy way to restart the syscall since other AIO's
@@ -1233,10 +1255,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return 0;

out_put_req:
- spin_lock_irq(&ctx->ctx_lock);
- list_del(&req->ki_list);
- spin_unlock_irq(&ctx->ctx_lock);
-
atomic_dec(&ctx->reqs_active);
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 1e728f0..d2a0003 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -10,17 +10,24 @@
#include <linux/atomic.h>

struct kioctx;
+struct kiocb;

#define KIOCB_SYNC_KEY (~0U)

-/* ki_flags bits */
-#define KIF_CANCELLED 2
-
-#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
-
-#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+/*
+ * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
+ * cancelled or completed (this makes a certain amount of sense because
+ * successful cancellation - io_cancel() - does deliver the completion to
+ * userspace).
+ *
+ * And since most things don't implement kiocb cancellation and we'd really like
+ * kiocb completion to be lockless when possible, we use ki_cancel to
+ * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
+ * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().
+ */
+#define KIOCB_CANCELLED ((void *) (~0ULL))

-#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);

/* is there a better place to document function pointer methods? */
/**
@@ -48,13 +55,12 @@ struct kioctx;
* calls may result in undefined behaviour.
*/
struct kiocb {
- unsigned long ki_flags;
atomic_t ki_users;
unsigned ki_key; /* id of this request */

struct file *ki_filp;
struct kioctx *ki_ctx; /* may be NULL for sync ops */
- int (*ki_cancel)(struct kiocb *, struct io_event *);
+ kiocb_cancel_fn *ki_cancel;
ssize_t (*ki_retry)(struct kiocb *);
void (*ki_dtor)(struct kiocb *);

@@ -112,6 +118,7 @@ struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user *__user *iocbpp, bool compat);
+void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
static inline void aio_put_req(struct kiocb *iocb) { }
@@ -121,6 +128,8 @@ static inline void exit_aio(struct mm_struct *mm) { }
static inline long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user * __user *iocbpp,
bool compat) { return 0; }
+static inline void kiocb_set_cancel_fn(struct kiocb *req,
+ kiocb_cancel_fn *cancel) { }
#endif /* CONFIG_AIO */

static inline struct kiocb *list_kiocb(struct list_head *h)
--
1.8.1.3

2013-03-21 16:42:25

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 14/33] aio: make aio_read_evt() more efficient, convert to hrtimers

Previously, aio_read_event() pulled a single completion off the ringbuffer
at a time, locking and unlocking each time. Change it to pull off as many
events as it can at a time, and copy them directly to userspace.

This also fixes a bug where if copying the event to userspace failed,
we'd lose the event.

Also convert it to wait_event_interruptible_hrtimeout(), which
simplifies it quite a bit.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 238 +++++++++++++++++++++++----------------------------------------
1 file changed, 88 insertions(+), 150 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1e3f72d..e9511d4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -63,7 +63,7 @@ struct aio_ring_info {
unsigned long mmap_size;

struct page **ring_pages;
- spinlock_t ring_lock;
+ struct mutex ring_lock;
long nr_pages;

unsigned nr, tail;
@@ -344,7 +344,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
atomic_set(&ctx->users, 2);
atomic_set(&ctx->dead, 0);
spin_lock_init(&ctx->ctx_lock);
- spin_lock_init(&ctx->ring_info.ring_lock);
+ mutex_init(&ctx->ring_info.ring_lock);
init_waitqueue_head(&ctx->wait);

INIT_LIST_HEAD(&ctx->active_reqs);
@@ -747,187 +747,125 @@ put_rq:
}
EXPORT_SYMBOL(aio_complete);

-/* aio_read_evt
- * Pull an event off of the ioctx's event ring. Returns the number of
- * events fetched (0 or 1 ;-)
- * FIXME: make this use cmpxchg.
- * TODO: make the ringbuffer user mmap()able (requires FIXME).
+/* aio_read_events
+ * Pull an event off of the ioctx's event ring. Returns the number of
+ * events fetched
*/
-static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
+static long aio_read_events_ring(struct kioctx *ctx,
+ struct io_event __user *event, long nr)
{
- struct aio_ring_info *info = &ioctx->ring_info;
+ struct aio_ring_info *info = &ctx->ring_info;
struct aio_ring *ring;
- unsigned long head;
- int ret = 0;
+ unsigned head, pos;
+ long ret = 0;
+ int copy_ret;
+
+ mutex_lock(&info->ring_lock);

ring = kmap_atomic(info->ring_pages[0]);
- pr_debug("h%u t%u m%u\n", ring->head, ring->tail, ring->nr);
+ head = ring->head;
+ kunmap_atomic(ring);
+
+ pr_debug("h%u t%u m%u\n", head, info->tail, info->nr);

- if (ring->head == ring->tail)
+ if (head == info->tail)
goto out;

- spin_lock(&info->ring_lock);
-
- head = ring->head % info->nr;
- if (head != ring->tail) {
- struct io_event *evp = aio_ring_event(info, head);
- *ent = *evp;
- head = (head + 1) % info->nr;
- smp_mb(); /* finish reading the event before updatng the head */
- ring->head = head;
- ret = 1;
- put_aio_ring_event(evp);
+ while (ret < nr) {
+ long avail = (head <= info->tail ? info->tail : info->nr) - head;
+ struct io_event *ev;
+ struct page *page;
+
+ if (head == info->tail)
+ break;
+
+ avail = min(avail, nr - ret);
+ avail = min_t(long, avail, AIO_EVENTS_PER_PAGE -
+ ((head + AIO_EVENTS_OFFSET) % AIO_EVENTS_PER_PAGE));
+
+ pos = head + AIO_EVENTS_OFFSET;
+ page = info->ring_pages[pos / AIO_EVENTS_PER_PAGE];
+ pos %= AIO_EVENTS_PER_PAGE;
+
+ ev = kmap(page);
+ copy_ret = copy_to_user(event + ret, ev + pos, sizeof(*ev) * avail);
+ kunmap(page);
+
+ if (unlikely(copy_ret)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret += avail;
+ head += avail;
+ head %= info->nr;
}
- spin_unlock(&info->ring_lock);

-out:
+ ring = kmap_atomic(info->ring_pages[0]);
+ ring->head = head;
kunmap_atomic(ring);
- pr_debug("%d h%u t%u\n", ret, ring->head, ring->tail);
+
+ pr_debug("%li h%u t%u\n", ret, head, info->tail);
+out:
+ mutex_unlock(&info->ring_lock);
+
return ret;
}

-struct aio_timeout {
- struct timer_list timer;
- int timed_out;
- struct task_struct *p;
-};
-
-static void timeout_func(unsigned long data)
+static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
+ struct io_event __user *event, long *i)
{
- struct aio_timeout *to = (struct aio_timeout *)data;
+ long ret = aio_read_events_ring(ctx, event + *i, nr - *i);

- to->timed_out = 1;
- wake_up_process(to->p);
-}
+ if (ret > 0)
+ *i += ret;

-static inline void init_timeout(struct aio_timeout *to)
-{
- setup_timer_on_stack(&to->timer, timeout_func, (unsigned long) to);
- to->timed_out = 0;
- to->p = current;
-}
+ if (unlikely(atomic_read(&ctx->dead)))
+ ret = -EINVAL;

-static inline void set_timeout(long start_jiffies, struct aio_timeout *to,
- const struct timespec *ts)
-{
- to->timer.expires = start_jiffies + timespec_to_jiffies(ts);
- if (time_after(to->timer.expires, jiffies))
- add_timer(&to->timer);
- else
- to->timed_out = 1;
-}
+ if (!*i)
+ *i = ret;

-static inline void clear_timeout(struct aio_timeout *to)
-{
- del_singleshot_timer_sync(&to->timer);
+ return ret < 0 || *i >= min_nr;
}

-static int read_events(struct kioctx *ctx,
- long min_nr, long nr,
+static long read_events(struct kioctx *ctx, long min_nr, long nr,
struct io_event __user *event,
struct timespec __user *timeout)
{
- long start_jiffies = jiffies;
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
- int ret;
- int i = 0;
- struct io_event ent;
- struct aio_timeout to;
-
- /* needed to zero any padding within an entry (there shouldn't be
- * any, but C is fun!
- */
- memset(&ent, 0, sizeof(ent));
- ret = 0;
- while (likely(i < nr)) {
- ret = aio_read_evt(ctx, &ent);
- if (unlikely(ret <= 0))
- break;
-
- pr_debug("%Lx %Lx %Lx %Lx\n",
- ent.data, ent.obj, ent.res, ent.res2);
-
- /* Could we split the check in two? */
- ret = -EFAULT;
- if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
- pr_debug("lost an event due to EFAULT.\n");
- break;
- }
- ret = 0;
-
- /* Good, event copied to userland, update counts. */
- event ++;
- i ++;
- }
-
- if (min_nr <= i)
- return i;
- if (ret)
- return ret;
-
- /* End fast path */
+ ktime_t until = { .tv64 = KTIME_MAX };
+ long ret = 0;

- init_timeout(&to);
if (timeout) {
struct timespec ts;
- ret = -EFAULT;
+
if (unlikely(copy_from_user(&ts, timeout, sizeof(ts))))
- goto out;
+ return -EFAULT;

- set_timeout(start_jiffies, &to, &ts);
+ until = timespec_to_ktime(ts);
}

- while (likely(i < nr)) {
- add_wait_queue_exclusive(&ctx->wait, &wait);
- do {
- set_task_state(tsk, TASK_INTERRUPTIBLE);
- ret = aio_read_evt(ctx, &ent);
- if (ret)
- break;
- if (min_nr <= i)
- break;
- if (unlikely(atomic_read(&ctx->dead))) {
- ret = -EINVAL;
- break;
- }
- if (to.timed_out) /* Only check after read evt */
- break;
- /* Try to only show up in io wait if there are ops
- * in flight */
- if (atomic_read(&ctx->reqs_active))
- io_schedule();
- else
- schedule();
- if (signal_pending(tsk)) {
- ret = -EINTR;
- break;
- }
- /*ret = aio_read_evt(ctx, &ent);*/
- } while (1) ;
-
- set_task_state(tsk, TASK_RUNNING);
- remove_wait_queue(&ctx->wait, &wait);
-
- if (unlikely(ret <= 0))
- break;
-
- ret = -EFAULT;
- if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
- pr_debug("lost an event due to EFAULT.\n");
- break;
- }
+ /*
+ * Note that aio_read_events() is being called as the conditional - i.e.
+ * we're calling it after prepare_to_wait() has set task state to
+ * TASK_INTERRUPTIBLE.
+ *
+ * But aio_read_events() can block, and if it blocks it's going to flip
+ * the task state back to TASK_RUNNING.
+ *
+ * This should be ok, provided it doesn't flip the state back to
+ * TASK_RUNNING and return 0 too much - that causes us to spin. That
+ * will only happen if the mutex_lock() call blocks, and we then find
+ * the ringbuffer empty. So in practice we should be ok, but it's
+ * something to be aware of when touching this code.
+ */
+ wait_event_interruptible_hrtimeout(ctx->wait,
+ aio_read_events(ctx, min_nr, nr, event, &ret), until);

- /* Good, event copied to userland, update counts. */
- event ++;
- i ++;
- }
+ if (!ret && signal_pending(current))
+ ret = -EINTR;

- if (timeout)
- clear_timeout(&to);
-out:
- destroy_timer_on_stack(&to.timer);
- return i ? i : ret;
+ return ret;
}

/* sys_io_setup:
--
1.8.1.3

2013-03-21 16:42:45

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 13/33] wait: add wait_event_hrtimeout()

Analagous to wait_event_timeout() and friends, this adds
wait_event_hrtimeout() and wait_event_interruptible_hrtimeout().

Note that unlike the versions that use regular timers, these don't return
the amount of time remaining when they return - instead, they return 0 or
-ETIME if they timed out. because I was uncomfortable with the semantics
of doing it the other way (that I could get it right, anyways).

If the timer expires, there's no real guarantee that expire_time -
current_time would be <= 0 - due to timer slack certainly, and I'm not
sure I want to know the implications of the different clock bases in
hrtimers.

If the timer does expire and the code calculates that the time remaining
is nonnegative, that could be even worse if the calling code then reuses
that timeout. Probably safer to just return 0 then, but I could imagine
weird bugs or at least unintended behaviour arising from that too.

I came to the conclusion that if other users end up actually needing the
amount of time remaining, the sanest thing to do would be to create a
version that uses absolute timeouts instead of relative.

[[email protected]: fix description of `timeout' arg]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/wait.h | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 86 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 7cb64d4..ac38be2 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -330,6 +330,92 @@ do { \
__ret; \
})

+#define __wait_event_hrtimeout(wq, condition, timeout, state) \
+({ \
+ int __ret = 0; \
+ DEFINE_WAIT(__wait); \
+ struct hrtimer_sleeper __t; \
+ \
+ hrtimer_init_on_stack(&__t.timer, CLOCK_MONOTONIC, \
+ HRTIMER_MODE_REL); \
+ hrtimer_init_sleeper(&__t, current); \
+ if ((timeout).tv64 != KTIME_MAX) \
+ hrtimer_start_range_ns(&__t.timer, timeout, \
+ current->timer_slack_ns, \
+ HRTIMER_MODE_REL); \
+ \
+ for (;;) { \
+ prepare_to_wait(&wq, &__wait, state); \
+ if (condition) \
+ break; \
+ if (state == TASK_INTERRUPTIBLE && \
+ signal_pending(current)) { \
+ __ret = -ERESTARTSYS; \
+ break; \
+ } \
+ if (!__t.task) { \
+ __ret = -ETIME; \
+ break; \
+ } \
+ schedule(); \
+ } \
+ \
+ hrtimer_cancel(&__t.timer); \
+ destroy_hrtimer_on_stack(&__t.timer); \
+ finish_wait(&wq, &__wait); \
+ __ret; \
+})
+
+/**
+ * wait_event_hrtimeout - sleep until a condition gets true or a timeout elapses
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @timeout: timeout, as a ktime_t
+ *
+ * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
+ * @condition evaluates to true or a signal is received.
+ * The @condition is checked each time the waitqueue @wq is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function returns 0 if @condition became true, or -ETIME if the timeout
+ * elapsed.
+ */
+#define wait_event_hrtimeout(wq, condition, timeout) \
+({ \
+ int __ret = 0; \
+ if (!(condition)) \
+ __ret = __wait_event_hrtimeout(wq, condition, timeout, \
+ TASK_UNINTERRUPTIBLE); \
+ __ret; \
+})
+
+/**
+ * wait_event_interruptible_hrtimeout - sleep until a condition gets true or a timeout elapses
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @timeout: timeout, as a ktime_t
+ *
+ * The process is put to sleep (TASK_INTERRUPTIBLE) until the
+ * @condition evaluates to true or a signal is received.
+ * The @condition is checked each time the waitqueue @wq is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function returns 0 if @condition became true, -ERESTARTSYS if it was
+ * interrupted by a signal, or -ETIME if the timeout elapsed.
+ */
+#define wait_event_interruptible_hrtimeout(wq, condition, timeout) \
+({ \
+ long __ret = 0; \
+ if (!(condition)) \
+ __ret = __wait_event_hrtimeout(wq, condition, timeout, \
+ TASK_INTERRUPTIBLE); \
+ __ret; \
+})
+
#define __wait_event_interruptible_exclusive(wq, condition, ret) \
do { \
DEFINE_WAIT(__wait); \
--
1.8.1.3

2013-03-21 16:36:21

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 05/33] char: add aio_{read,write} to /dev/{null,zero}

From: Zach Brown <[email protected]>

These are handy for measuring the cost of the aio infrastructure with
operations that do very little and complete immediately.

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/char/mem.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 2c644af..e49265f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -627,6 +627,18 @@ static ssize_t write_null(struct file *file, const char __user *buf,
return count;
}

+static ssize_t aio_read_null(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ return 0;
+}
+
+static ssize_t aio_write_null(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ return iov_length(iov, nr_segs);
+}
+
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
@@ -670,6 +682,24 @@ static ssize_t read_zero(struct file *file, char __user *buf,
return written ? written : -EFAULT;
}

+static ssize_t aio_read_zero(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ size_t written = 0;
+ unsigned long i;
+ ssize_t ret;
+
+ for (i = 0; i < nr_segs; i++) {
+ ret = read_zero(iocb->ki_filp, iov[i].iov_base, iov[i].iov_len,
+ &pos);
+ if (ret < 0)
+ break;
+ written += ret;
+ }
+
+ return written ? written : -EFAULT;
+}
+
static int mmap_zero(struct file *file, struct vm_area_struct *vma)
{
#ifndef CONFIG_MMU
@@ -738,6 +768,7 @@ static int open_port(struct inode *inode, struct file *filp)
#define full_lseek null_lseek
#define write_zero write_null
#define read_full read_zero
+#define aio_write_zero aio_write_null
#define open_mem open_port
#define open_kmem open_mem
#define open_oldmem open_mem
@@ -766,6 +797,8 @@ static const struct file_operations null_fops = {
.llseek = null_lseek,
.read = read_null,
.write = write_null,
+ .aio_read = aio_read_null,
+ .aio_write = aio_write_null,
.splice_write = splice_write_null,
};

@@ -782,6 +815,8 @@ static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.read = read_zero,
.write = write_zero,
+ .aio_read = aio_read_zero,
+ .aio_write = aio_write_zero,
.mmap = mmap_zero,
};

--
1.8.1.3

2013-03-21 16:43:07

by Felipe Balbi

[permalink] [raw]
Subject: Re: [PATCH 03/33] gadget: remove only user of aio retry

Hi,

On Thu, Mar 21, 2013 at 09:35:24AM -0700, Kent Overstreet wrote:
> From: Zach Brown <[email protected]>
>
> This removes the only in-tree user of aio retry. This will let us remove
> the retry code from the aio core.
>
> Removing retry is relatively easy as the USB gadget wasn't using it to
> retry IOs at all. It always fully submitted the IO in the context of the
> initial io_submit() call. It only used the AIO retry facility to get the
> submitter's mm context for copying the result of a read back to user
> space. This is easy to implement with use_mm() and a work struct, much
> like kvm does with async_pf_execute() for get_user_pages().
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

I don't have any objections with the approach. Let's see if anyone from
linux-usb has anything to say, though

> drivers/usb/gadget/inode.c | 38 +++++++++++++++++++++++++++++---------
> 1 file changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
> index e2b2e9c..a1aad43 100644
> --- a/drivers/usb/gadget/inode.c
> +++ b/drivers/usb/gadget/inode.c
> @@ -24,6 +24,7 @@
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/poll.h>
> +#include <linux/mmu_context.h>
>
> #include <linux/device.h>
> #include <linux/moduleparam.h>
> @@ -513,6 +514,9 @@ static long ep_ioctl(struct file *fd, unsigned code, unsigned long value)
> struct kiocb_priv {
> struct usb_request *req;
> struct ep_data *epdata;
> + struct kiocb *iocb;
> + struct mm_struct *mm;
> + struct work_struct work;
> void *buf;
> const struct iovec *iv;
> unsigned long nr_segs;
> @@ -540,15 +544,12 @@ static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
> return value;
> }
>
> -static ssize_t ep_aio_read_retry(struct kiocb *iocb)
> +static ssize_t ep_copy_to_user(struct kiocb_priv *priv)
> {
> - struct kiocb_priv *priv = iocb->private;
> ssize_t len, total;
> void *to_copy;
> int i;
>
> - /* we "retry" to get the right mm context for this: */
> -
> /* copy stuff into user buffers */
> total = priv->actual;
> len = 0;
> @@ -568,9 +569,26 @@ static ssize_t ep_aio_read_retry(struct kiocb *iocb)
> if (total == 0)
> break;
> }
> +
> + return len;
> +}
> +
> +static void ep_user_copy_worker(struct work_struct *work)
> +{
> + struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
> + struct mm_struct *mm = priv->mm;
> + struct kiocb *iocb = priv->iocb;
> + size_t ret;
> +
> + use_mm(mm);
> + ret = ep_copy_to_user(priv);
> + unuse_mm(mm);
> +
> + /* completing the iocb can drop the ctx and mm, don't touch mm after */
> + aio_complete(iocb, ret, ret);
> +
> kfree(priv->buf);
> kfree(priv);
> - return len;
> }
>
> static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
> @@ -596,14 +614,14 @@ static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
> aio_complete(iocb, req->actual ? req->actual : req->status,
> req->status);
> } else {
> - /* retry() won't report both; so we hide some faults */
> + /* ep_copy_to_user() won't report both; we hide some faults */
> if (unlikely(0 != req->status))
> DBG(epdata->dev, "%s fault %d len %d\n",
> ep->name, req->status, req->actual);
>
> priv->buf = req->buf;
> priv->actual = req->actual;
> - kick_iocb(iocb);
> + schedule_work(&priv->work);
> }
> spin_unlock(&epdata->dev->lock);
>
> @@ -633,8 +651,10 @@ fail:
> return value;
> }
> iocb->private = priv;
> + priv->iocb = iocb;
> priv->iv = iv;
> priv->nr_segs = nr_segs;
> + INIT_WORK(&priv->work, ep_user_copy_worker);
>
> value = get_ready_ep(iocb->ki_filp->f_flags, epdata);
> if (unlikely(value < 0)) {
> @@ -646,6 +666,7 @@ fail:
> get_ep(epdata);
> priv->epdata = epdata;
> priv->actual = 0;
> + priv->mm = current->mm; /* mm teardown waits for iocbs in exit_aio() */
>
> /* each kiocb is coupled to one usb_request, but we can't
> * allocate or submit those if the host disconnected.
> @@ -674,7 +695,7 @@ fail:
> kfree(priv);
> put_ep(epdata);
> } else
> - value = (iv ? -EIOCBRETRY : -EIOCBQUEUED);
> + value = -EIOCBQUEUED;
> return value;
> }
>
> @@ -692,7 +713,6 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
> if (unlikely(!buf))
> return -ENOMEM;
>
> - iocb->ki_retry = ep_aio_read_retry;
> return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
> }
>
> --
> 1.8.1.3
>

--
balbi


Attachments:
(No filename) (5.03 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2013-03-21 16:43:25

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 07/33] aio: add kiocb_cancel()

Minor refactoring, to get rid of some duplicated code

[[email protected]: fix warning]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 79 +++++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 43 insertions(+), 36 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6b29e41a..d291228 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -220,6 +220,29 @@ static inline void put_ioctx(struct kioctx *kioctx)
__put_ioctx(kioctx);
}

+static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
+ struct io_event *res)
+{
+ int (*cancel)(struct kiocb *, struct io_event *);
+ int ret = -EINVAL;
+
+ cancel = kiocb->ki_cancel;
+ kiocbSetCancelled(kiocb);
+ if (cancel) {
+ kiocb->ki_users++;
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ memset(res, 0, sizeof(*res));
+ res->obj = (u64)(unsigned long)kiocb->ki_obj.user;
+ res->data = kiocb->ki_user_data;
+ ret = cancel(kiocb, res);
+
+ spin_lock_irq(&ctx->ctx_lock);
+ }
+
+ return ret;
+}
+
/* ioctx_alloc
* Allocates and initializes an ioctx. Returns an ERR_PTR if it failed.
*/
@@ -290,25 +313,19 @@ out_freectx:
*/
static void kill_ctx(struct kioctx *ctx)
{
- int (*cancel)(struct kiocb *, struct io_event *);
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
struct io_event res;
+ struct kiocb *req;

spin_lock_irq(&ctx->ctx_lock);
ctx->dead = 1;
while (!list_empty(&ctx->active_reqs)) {
- struct list_head *pos = ctx->active_reqs.next;
- struct kiocb *iocb = list_kiocb(pos);
- list_del_init(&iocb->ki_list);
- cancel = iocb->ki_cancel;
- kiocbSetCancelled(iocb);
- if (cancel) {
- iocb->ki_users++;
- spin_unlock_irq(&ctx->ctx_lock);
- cancel(iocb, &res);
- spin_lock_irq(&ctx->ctx_lock);
- }
+ req = list_first_entry(&ctx->active_reqs,
+ struct kiocb, ki_list);
+
+ list_del_init(&req->ki_list);
+ kiocb_cancel(ctx, req, &res);
}

if (!ctx->reqs_active)
@@ -1411,7 +1428,7 @@ static struct kiocb *lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb,
SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
struct io_event __user *, result)
{
- int (*cancel)(struct kiocb *iocb, struct io_event *res);
+ struct io_event res;
struct kioctx *ctx;
struct kiocb *kiocb;
u32 key;
@@ -1426,32 +1443,22 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
return -EINVAL;

spin_lock_irq(&ctx->ctx_lock);
- ret = -EAGAIN;
+
kiocb = lookup_kiocb(ctx, iocb, key);
- if (kiocb && kiocb->ki_cancel) {
- cancel = kiocb->ki_cancel;
- kiocb->ki_users ++;
- kiocbSetCancelled(kiocb);
- } else
- cancel = NULL;
+ if (kiocb)
+ ret = kiocb_cancel(ctx, kiocb, &res);
+ else
+ ret = -EINVAL;
+
spin_unlock_irq(&ctx->ctx_lock);

- if (NULL != cancel) {
- struct io_event tmp;
- pr_debug("calling cancel\n");
- memset(&tmp, 0, sizeof(tmp));
- tmp.obj = (u64)(unsigned long)kiocb->ki_obj.user;
- tmp.data = kiocb->ki_user_data;
- ret = cancel(kiocb, &tmp);
- if (!ret) {
- /* Cancellation succeeded -- copy the result
- * into the user's buffer.
- */
- if (copy_to_user(result, &tmp, sizeof(tmp)))
- ret = -EFAULT;
- }
- } else
- ret = -EINVAL;
+ if (!ret) {
+ /* Cancellation succeeded -- copy the result
+ * into the user's buffer.
+ */
+ if (copy_to_user(result, &res, sizeof(res)))
+ ret = -EFAULT;
+ }

put_ioctx(ctx);

--
1.8.1.3

2013-03-21 16:43:51

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 12/33] aio: refcounting cleanup

The usage of ctx->dead was fubar - it makes no sense to explicitly check
it all over the place, especially when we're already using RCU.

Now, ctx->dead only indicates whether we've dropped the initial
refcount. The new teardown sequence is:
set ctx->dead
hlist_del_rcu();
synchronize_rcu();

Now we know no system calls can take a new ref, and it's safe to drop
the initial ref:
put_ioctx();

We also need to ensure there are no more outstanding kiocbs. This was
done incorrectly - it was being done in kill_ctx(), and before dropping
the initial refcount. At this point, other syscalls may still be
submitting kiocbs!

Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
kioctx->users has dropped to 0 and we know no more iocbs could be
submitted.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 272 ++++++++++++++++++++++++++++-----------------------------------
1 file changed, 119 insertions(+), 153 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 3524bb2..1e3f72d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -79,7 +79,7 @@ static inline unsigned aio_ring_avail(struct aio_ring_info *info,

struct kioctx {
atomic_t users;
- int dead;
+ atomic_t dead;

/* This needs improving */
unsigned long user_id;
@@ -98,6 +98,7 @@ struct kioctx {
struct aio_ring_info ring_info;

struct rcu_head rcu_head;
+ struct work_struct rcu_work;
};

/*------ sysctl variables----*/
@@ -237,44 +238,6 @@ static int aio_setup_ring(struct kioctx *ctx)
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \
} while(0)

-static void ctx_rcu_free(struct rcu_head *head)
-{
- struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
- kmem_cache_free(kioctx_cachep, ctx);
-}
-
-/* __put_ioctx
- * Called when the last user of an aio context has gone away,
- * and the struct needs to be freed.
- */
-static void __put_ioctx(struct kioctx *ctx)
-{
- unsigned nr_events = ctx->max_reqs;
- BUG_ON(atomic_read(&ctx->reqs_active));
-
- aio_free_ring(ctx);
- if (nr_events) {
- spin_lock(&aio_nr_lock);
- BUG_ON(aio_nr - nr_events > aio_nr);
- aio_nr -= nr_events;
- spin_unlock(&aio_nr_lock);
- }
- pr_debug("freeing %p\n", ctx);
- call_rcu(&ctx->rcu_head, ctx_rcu_free);
-}
-
-static inline int try_get_ioctx(struct kioctx *kioctx)
-{
- return atomic_inc_not_zero(&kioctx->users);
-}
-
-static inline void put_ioctx(struct kioctx *kioctx)
-{
- BUG_ON(atomic_read(&kioctx->users) <= 0);
- if (unlikely(atomic_dec_and_test(&kioctx->users)))
- __put_ioctx(kioctx);
-}
-
static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
struct io_event *res)
{
@@ -298,6 +261,61 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
return ret;
}

+static void free_ioctx_rcu(struct rcu_head *head)
+{
+ struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+ kmem_cache_free(kioctx_cachep, ctx);
+}
+
+/*
+ * When this function runs, the kioctx has been removed from the "hash table"
+ * and ctx->users has dropped to 0, so we know no more kiocbs can be submitted -
+ * now it's safe to cancel any that need to be.
+ */
+static void free_ioctx(struct kioctx *ctx)
+{
+ struct io_event res;
+ struct kiocb *req;
+
+ spin_lock_irq(&ctx->ctx_lock);
+
+ while (!list_empty(&ctx->active_reqs)) {
+ req = list_first_entry(&ctx->active_reqs,
+ struct kiocb, ki_list);
+
+ list_del_init(&req->ki_list);
+ kiocb_cancel(ctx, req, &res);
+ }
+
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ wait_event(ctx->wait, !atomic_read(&ctx->reqs_active));
+
+ aio_free_ring(ctx);
+
+ spin_lock(&aio_nr_lock);
+ BUG_ON(aio_nr - ctx->max_reqs > aio_nr);
+ aio_nr -= ctx->max_reqs;
+ spin_unlock(&aio_nr_lock);
+
+ pr_debug("freeing %p\n", ctx);
+
+ /*
+ * Here the call_rcu() is between the wait_event() for reqs_active to
+ * hit 0, and freeing the ioctx.
+ *
+ * aio_complete() decrements reqs_active, but it has to touch the ioctx
+ * after to issue a wakeup so we use rcu.
+ */
+ call_rcu(&ctx->rcu_head, free_ioctx_rcu);
+}
+
+static void put_ioctx(struct kioctx *ctx)
+{
+ if (unlikely(atomic_dec_and_test(&ctx->users)))
+ free_ioctx(ctx);
+}
+
/* ioctx_alloc
* Allocates and initializes an ioctx. Returns an ERR_PTR if it failed.
*/
@@ -324,6 +342,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
ctx->max_reqs = nr_events;

atomic_set(&ctx->users, 2);
+ atomic_set(&ctx->dead, 0);
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->ring_info.ring_lock);
init_waitqueue_head(&ctx->wait);
@@ -361,44 +380,43 @@ out_freectx:
return ERR_PTR(err);
}

-/* kill_ctx
- * Cancels all outstanding aio requests on an aio context. Used
- * when the processes owning a context have all exited to encourage
- * the rapid destruction of the kioctx.
- */
-static void kill_ctx(struct kioctx *ctx)
+static void kill_ioctx_work(struct work_struct *work)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
- struct io_event res;
- struct kiocb *req;
+ struct kioctx *ctx = container_of(work, struct kioctx, rcu_work);

- spin_lock_irq(&ctx->ctx_lock);
- ctx->dead = 1;
- while (!list_empty(&ctx->active_reqs)) {
- req = list_first_entry(&ctx->active_reqs,
- struct kiocb, ki_list);
+ wake_up_all(&ctx->wait);
+ put_ioctx(ctx);
+}

- list_del_init(&req->ki_list);
- kiocb_cancel(ctx, req, &res);
- }
+static void kill_ioctx_rcu(struct rcu_head *head)
+{
+ struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);

- if (!atomic_read(&ctx->reqs_active))
- goto out;
+ INIT_WORK(&ctx->rcu_work, kill_ioctx_work);
+ schedule_work(&ctx->rcu_work);
+}

- add_wait_queue(&ctx->wait, &wait);
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- while (atomic_read(&ctx->reqs_active)) {
- spin_unlock_irq(&ctx->ctx_lock);
- io_schedule();
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- spin_lock_irq(&ctx->ctx_lock);
- }
- __set_task_state(tsk, TASK_RUNNING);
- remove_wait_queue(&ctx->wait, &wait);
+/* kill_ioctx
+ * Cancels all outstanding aio requests on an aio context. Used
+ * when the processes owning a context have all exited to encourage
+ * the rapid destruction of the kioctx.
+ */
+static void kill_ioctx(struct kioctx *ctx)
+{
+ if (!atomic_xchg(&ctx->dead, 1)) {
+ hlist_del_rcu(&ctx->list);
+ /* Between hlist_del_rcu() and dropping the initial ref */
+ synchronize_rcu();

-out:
- spin_unlock_irq(&ctx->ctx_lock);
+ /*
+ * We can't punt to workqueue here because put_ioctx() ->
+ * free_ioctx() will unmap the ringbuffer, and that has to be
+ * done in the original process's context. kill_ioctx_rcu/work()
+ * exist for exit_aio(), as in that path free_ioctx() won't do
+ * the unmap.
+ */
+ kill_ioctx_work(&ctx->rcu_work);
+ }
}

/* wait_on_sync_kiocb:
@@ -417,27 +435,25 @@ ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
}
EXPORT_SYMBOL(wait_on_sync_kiocb);

-/* exit_aio: called when the last user of mm goes away. At this point,
- * there is no way for any new requests to be submited or any of the
- * io_* syscalls to be called on the context. However, there may be
- * outstanding requests which hold references to the context; as they
- * go away, they will call put_ioctx and release any pinned memory
- * associated with the request (held via struct page * references).
+/*
+ * exit_aio: called when the last user of mm goes away. At this point, there is
+ * no way for any new requests to be submited or any of the io_* syscalls to be
+ * called on the context.
+ *
+ * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
+ * them.
*/
void exit_aio(struct mm_struct *mm)
{
struct kioctx *ctx;
+ struct hlist_node *n;

- while (!hlist_empty(&mm->ioctx_list)) {
- ctx = hlist_entry(mm->ioctx_list.first, struct kioctx, list);
- hlist_del_rcu(&ctx->list);
-
- kill_ctx(ctx);
-
+ hlist_for_each_entry_safe(ctx, n, &mm->ioctx_list, list) {
if (1 != atomic_read(&ctx->users))
printk(KERN_DEBUG
"exit_aio:ioctx still alive: %d %d %d\n",
- atomic_read(&ctx->users), ctx->dead,
+ atomic_read(&ctx->users),
+ atomic_read(&ctx->dead),
atomic_read(&ctx->reqs_active));
/*
* We don't need to bother with munmap() here -
@@ -448,7 +464,11 @@ void exit_aio(struct mm_struct *mm)
* place that uses ->mmap_size, so it's safe.
*/
ctx->ring_info.mmap_size = 0;
- put_ioctx(ctx);
+
+ if (!atomic_xchg(&ctx->dead, 1)) {
+ hlist_del_rcu(&ctx->list);
+ call_rcu(&ctx->rcu_head, kill_ioctx_rcu);
+ }
}
}

@@ -514,8 +534,6 @@ static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
kmem_cache_free(kiocb_cachep, req);
atomic_dec(&ctx->reqs_active);
}
- if (unlikely(!atomic_read(&ctx->reqs_active) && ctx->dead))
- wake_up_all(&ctx->wait);
spin_unlock_irq(&ctx->ctx_lock);
}

@@ -611,13 +629,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
rcu_read_lock();

hlist_for_each_entry_rcu(ctx, &mm->ioctx_list, list) {
- /*
- * RCU protects us against accessing freed memory but
- * we have to be careful not to get a reference when the
- * reference count already dropped to 0 (ctx->dead test
- * is unreliable because of races).
- */
- if (ctx->user_id == ctx_id && !ctx->dead && try_get_ioctx(ctx)){
+ if (ctx->user_id == ctx_id){
+ atomic_inc(&ctx->users);
ret = ctx;
break;
}
@@ -656,12 +669,15 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

info = &ctx->ring_info;

- /* add a completion event to the ring buffer.
- * must be done holding ctx->ctx_lock to prevent
- * other code from messing with the tail
- * pointer since we might be called from irq
- * context.
+ /*
+ * Add a completion event to the ring buffer. Must be done holding
+ * ctx->ctx_lock to prevent other code from messing with the tail
+ * pointer since we might be called from irq context.
+ *
+ * Take rcu_read_lock() in case the kioctx is being destroyed, as we
+ * need to issue a wakeup after decrementing reqs_active.
*/
+ rcu_read_lock();
spin_lock_irqsave(&ctx->ctx_lock, flags);

list_del(&iocb->ki_list); /* remove from active_reqs */
@@ -727,6 +743,7 @@ put_rq:
wake_up(&ctx->wait);

spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ rcu_read_unlock();
}
EXPORT_SYMBOL(aio_complete);

@@ -870,7 +887,7 @@ static int read_events(struct kioctx *ctx,
break;
if (min_nr <= i)
break;
- if (unlikely(ctx->dead)) {
+ if (unlikely(atomic_read(&ctx->dead))) {
ret = -EINVAL;
break;
}
@@ -913,35 +930,6 @@ out:
return i ? i : ret;
}

-/* Take an ioctx and remove it from the list of ioctx's. Protects
- * against races with itself via ->dead.
- */
-static void io_destroy(struct kioctx *ioctx)
-{
- struct mm_struct *mm = current->mm;
- int was_dead;
-
- /* delete the entry from the list is someone else hasn't already */
- spin_lock(&mm->ioctx_lock);
- was_dead = ioctx->dead;
- ioctx->dead = 1;
- hlist_del_rcu(&ioctx->list);
- spin_unlock(&mm->ioctx_lock);
-
- pr_debug("(%p)\n", ioctx);
- if (likely(!was_dead))
- put_ioctx(ioctx); /* twice for the list */
-
- kill_ctx(ioctx);
-
- /*
- * Wake up any waiters. The setting of ctx->dead must be seen
- * by other CPUs at this point. Right now, we rely on the
- * locking done by the above calls to ensure this consistency.
- */
- wake_up_all(&ioctx->wait);
-}
-
/* sys_io_setup:
* Create an aio_context capable of receiving at least nr_events.
* ctxp must not point to an aio_context that already exists, and
@@ -977,7 +965,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
if (!IS_ERR(ioctx)) {
ret = put_user(ioctx->user_id, ctxp);
if (ret)
- io_destroy(ioctx);
+ kill_ioctx(ioctx);
put_ioctx(ioctx);
}

@@ -995,7 +983,7 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
{
struct kioctx *ioctx = lookup_ioctx(ctx);
if (likely(NULL != ioctx)) {
- io_destroy(ioctx);
+ kill_ioctx(ioctx);
put_ioctx(ioctx);
return 0;
}
@@ -1298,25 +1286,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (ret)
goto out_put_req;

- spin_lock_irq(&ctx->ctx_lock);
- /*
- * We could have raced with io_destroy() and are currently holding a
- * reference to ctx which should be destroyed. We cannot submit IO
- * since ctx gets freed as soon as io_submit() puts its reference. The
- * check here is reliable: io_destroy() sets ctx->dead before waiting
- * for outstanding IO and the barrier between these two is realized by
- * unlock of mm->ioctx_lock and lock of ctx->ctx_lock. Analogously we
- * increment ctx->reqs_active before checking for ctx->dead and the
- * barrier is realized by unlock and lock of ctx->ctx_lock. Thus if we
- * don't see ctx->dead set here, io_destroy() waits for our IO to
- * finish.
- */
- if (ctx->dead)
- ret = -EINVAL;
- spin_unlock_irq(&ctx->ctx_lock);
- if (ret)
- goto out_put_req;
-
if (unlikely(kiocbIsCancelled(req))) {
ret = -EINTR;
} else {
@@ -1342,9 +1311,6 @@ out_put_req:
spin_unlock_irq(&ctx->ctx_lock);

atomic_dec(&ctx->reqs_active);
- if (unlikely(!atomic_read(&ctx->reqs_active) && ctx->dead))
- wake_up_all(&ctx->wait);
-
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
--
1.8.1.3

2013-03-21 16:44:07

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 28/33] aio: kill ki_retry

Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
need this anymore - ki_retry was only called right after the kiocb was
initialized.

This also refactors and trims some duplicated code, as well as cleaning up
the refcounting/error handling a bit.

[[email protected]: use fmode_t in aio_run_iocb()]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 218 ++++++++++++++++++++--------------------------------
include/linux/aio.h | 26 -------
2 files changed, 82 insertions(+), 162 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8f6fb4d..ba23c03 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1005,24 +1005,15 @@ static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
BUG_ON(ret > 0 && iocb->ki_left == 0);
}

-static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
+typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
+ unsigned long, loff_t);
+
+static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- ssize_t (*rw_op)(struct kiocb *, const struct iovec *,
- unsigned long, loff_t);
ssize_t ret = 0;
- unsigned short opcode;
-
- if ((iocb->ki_opcode == IOCB_CMD_PREADV) ||
- (iocb->ki_opcode == IOCB_CMD_PREAD)) {
- rw_op = file->f_op->aio_read;
- opcode = IOCB_CMD_PREADV;
- } else {
- rw_op = file->f_op->aio_write;
- opcode = IOCB_CMD_PWRITEV;
- }

/* This matches the pread()/pwrite() logic */
if (iocb->ki_pos < 0)
@@ -1038,7 +1029,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
/* retry all partial writes. retry partial reads as long as its a
* regular file. */
} while (ret > 0 && iocb->ki_left > 0 &&
- (opcode == IOCB_CMD_PWRITEV ||
+ (rw == WRITE ||
(!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode))));

/* This means we must have transferred all that we could */
@@ -1048,7 +1039,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)

/* If we managed to write some out we return that, rather than
* the eventual error. */
- if (opcode == IOCB_CMD_PWRITEV
+ if (rw == WRITE
&& ret < 0 && ret != -EIOCBQUEUED
&& iocb->ki_nbytes - iocb->ki_left)
ret = iocb->ki_nbytes - iocb->ki_left;
@@ -1056,73 +1047,41 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
return ret;
}

-static ssize_t aio_fdsync(struct kiocb *iocb)
-{
- struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
-
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 1);
- return ret;
-}
-
-static ssize_t aio_fsync(struct kiocb *iocb)
-{
- struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
-
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 0);
- return ret;
-}
-
-static ssize_t aio_setup_vectored_rw(int type, struct kiocb *kiocb, bool compat)
+static ssize_t aio_setup_vectored_rw(int rw, struct kiocb *kiocb, bool compat)
{
ssize_t ret;

+ kiocb->ki_nr_segs = kiocb->ki_nbytes;
+
#ifdef CONFIG_COMPAT
if (compat)
- ret = compat_rw_copy_check_uvector(type,
+ ret = compat_rw_copy_check_uvector(rw,
(struct compat_iovec __user *)kiocb->ki_buf,
- kiocb->ki_nbytes, 1, &kiocb->ki_inline_vec,
+ kiocb->ki_nr_segs, 1, &kiocb->ki_inline_vec,
&kiocb->ki_iovec);
else
#endif
- ret = rw_copy_check_uvector(type,
+ ret = rw_copy_check_uvector(rw,
(struct iovec __user *)kiocb->ki_buf,
- kiocb->ki_nbytes, 1, &kiocb->ki_inline_vec,
+ kiocb->ki_nr_segs, 1, &kiocb->ki_inline_vec,
&kiocb->ki_iovec);
if (ret < 0)
- goto out;
-
- ret = rw_verify_area(type, kiocb->ki_filp, &kiocb->ki_pos, ret);
- if (ret < 0)
- goto out;
+ return ret;

- kiocb->ki_nr_segs = kiocb->ki_nbytes;
- kiocb->ki_cur_seg = 0;
- /* ki_nbytes/left now reflect bytes instead of segs */
+ /* ki_nbytes now reflect bytes instead of segs */
kiocb->ki_nbytes = ret;
- kiocb->ki_left = ret;
-
- ret = 0;
-out:
- return ret;
+ return 0;
}

-static ssize_t aio_setup_single_vector(int type, struct file * file, struct kiocb *kiocb)
+static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
{
- int bytes;
-
- bytes = rw_verify_area(type, file, &kiocb->ki_pos, kiocb->ki_left);
- if (bytes < 0)
- return bytes;
+ if (unlikely(!access_ok(!rw, kiocb->ki_buf, kiocb->ki_nbytes)))
+ return -EFAULT;

kiocb->ki_iovec = &kiocb->ki_inline_vec;
kiocb->ki_iovec->iov_base = kiocb->ki_buf;
- kiocb->ki_iovec->iov_len = bytes;
+ kiocb->ki_iovec->iov_len = kiocb->ki_nbytes;
kiocb->ki_nr_segs = 1;
- kiocb->ki_cur_seg = 0;
return 0;
}

@@ -1131,81 +1090,81 @@ static ssize_t aio_setup_single_vector(int type, struct file * file, struct kioc
* Performs the initial checks and aio retry method
* setup for the kiocb at the time of io submission.
*/
-static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
+static ssize_t aio_run_iocb(struct kiocb *req, bool compat)
{
- struct file *file = kiocb->ki_filp;
- ssize_t ret = 0;
+ struct file *file = req->ki_filp;
+ ssize_t ret;
+ int rw;
+ fmode_t mode;
+ aio_rw_op *rw_op;

- switch (kiocb->ki_opcode) {
+ switch (req->ki_opcode) {
case IOCB_CMD_PREAD:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- break;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_WRITE, kiocb->ki_buf,
- kiocb->ki_left)))
- break;
- ret = aio_setup_single_vector(READ, file, kiocb);
- if (ret)
- break;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- kiocb->ki_retry = aio_rw_vect_retry;
- break;
- case IOCB_CMD_PWRITE:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- break;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_READ, kiocb->ki_buf,
- kiocb->ki_left)))
- break;
- ret = aio_setup_single_vector(WRITE, file, kiocb);
- if (ret)
- break;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- kiocb->ki_retry = aio_rw_vect_retry;
- break;
case IOCB_CMD_PREADV:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- break;
- ret = aio_setup_vectored_rw(READ, kiocb, compat);
- if (ret)
- break;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- kiocb->ki_retry = aio_rw_vect_retry;
- break;
+ mode = FMODE_READ;
+ rw = READ;
+ rw_op = file->f_op->aio_read;
+ goto rw_common;
+
+ case IOCB_CMD_PWRITE:
case IOCB_CMD_PWRITEV:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- break;
- ret = aio_setup_vectored_rw(WRITE, kiocb, compat);
+ mode = FMODE_WRITE;
+ rw = WRITE;
+ rw_op = file->f_op->aio_write;
+ goto rw_common;
+rw_common:
+ if (unlikely(!(file->f_mode & mode)))
+ return -EBADF;
+
+ if (!rw_op)
+ return -EINVAL;
+
+ ret = (req->ki_opcode == IOCB_CMD_PREADV ||
+ req->ki_opcode == IOCB_CMD_PWRITEV)
+ ? aio_setup_vectored_rw(rw, req, compat)
+ : aio_setup_single_vector(rw, req);
if (ret)
- break;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- kiocb->ki_retry = aio_rw_vect_retry;
+ return ret;
+
+ ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes);
+ if (ret < 0)
+ return ret;
+
+ req->ki_nbytes = ret;
+ req->ki_left = ret;
+
+ ret = aio_rw_vect_retry(req, rw, rw_op);
break;
+
case IOCB_CMD_FDSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- kiocb->ki_retry = aio_fdsync;
+ if (!file->f_op->aio_fsync)
+ return -EINVAL;
+
+ ret = file->f_op->aio_fsync(req, 1);
break;
+
case IOCB_CMD_FSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- kiocb->ki_retry = aio_fsync;
+ if (!file->f_op->aio_fsync)
+ return -EINVAL;
+
+ ret = file->f_op->aio_fsync(req, 0);
break;
+
default:
pr_debug("EINVAL: no operation provided\n");
- ret = -EINVAL;
+ return -EINVAL;
}

- if (!kiocb->ki_retry)
- return ret;
+ if (ret != -EIOCBQUEUED) {
+ /*
+ * There's no easy way to restart the syscall since other AIO's
+ * may be already running. Just fail this IO with EINTR.
+ */
+ if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+ ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
+ ret = -EINTR;
+ aio_complete(req, ret, 0);
+ }

return 0;
}
@@ -1232,7 +1191,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return -EINVAL;
}

- req = aio_get_req(ctx); /* returns with 2 references to req */
+ req = aio_get_req(ctx);
if (unlikely(!req))
return -EAGAIN;

@@ -1271,25 +1230,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;

- ret = aio_setup_iocb(req, compat);
+ ret = aio_run_iocb(req, compat);
if (ret)
goto out_put_req;

- ret = req->ki_retry(req);
- if (ret != -EIOCBQUEUED) {
- /*
- * There's no easy way to restart the syscall since other AIO's
- * may be already running. Just fail this IO with EINTR.
- */
- if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
- ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
- ret = -EINTR;
- aio_complete(req, ret, 0);
- }
-
aio_put_req(req); /* drop extra ref to req */
return 0;
-
out_put_req:
put_reqs_available(ctx, 1);
aio_put_req(req); /* drop extra ref to req */
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7308836..1bdf965 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -29,38 +29,12 @@ struct kiocb;

typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);

-/* is there a better place to document function pointer methods? */
-/**
- * ki_retry - iocb forward progress callback
- * @kiocb: The kiocb struct to advance by performing an operation.
- *
- * This callback is called when the AIO core wants a given AIO operation
- * to make forward progress. The kiocb argument describes the operation
- * that is to be performed. As the operation proceeds, perhaps partially,
- * ki_retry is expected to update the kiocb with progress made. Typically
- * ki_retry is set in the AIO core and it itself calls file_operations
- * helpers.
- *
- * ki_retry's return value determines when the AIO operation is completed
- * and an event is generated in the AIO event ring. Except the special
- * return values described below, the value that is returned from ki_retry
- * is transferred directly into the completion ring as the operation's
- * resulting status. Once this has happened ki_retry *MUST NOT* reference
- * the kiocb pointer again.
- *
- * If ki_retry returns -EIOCBQUEUED it has made a promise that aio_complete()
- * will be called on the kiocb pointer in the future. The AIO core will
- * not ask the method again -- ki_retry must ensure forward progress.
- * aio_complete() must be called once and only once in the future, multiple
- * calls may result in undefined behaviour.
- */
struct kiocb {
atomic_t ki_users;

struct file *ki_filp;
struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
- ssize_t (*ki_retry)(struct kiocb *);
void (*ki_dtor)(struct kiocb *);

union {
--
1.8.1.3

2013-03-21 16:44:28

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 24/33] aio: percpu ioctx refcount

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 603511d..3db2dab 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -36,6 +36,7 @@
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
+#include <linux/percpu-refcount.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -65,8 +66,7 @@ struct kioctx_cpu {
};

struct kioctx {
- atomic_t users;
- atomic_t dead;
+ struct percpu_ref users;

/* This needs improving */
unsigned long user_id;
@@ -368,7 +368,7 @@ static void free_ioctx(struct kioctx *ctx)

static void put_ioctx(struct kioctx *ctx)
{
- if (unlikely(atomic_dec_and_test(&ctx->users)))
+ if (percpu_ref_put(&ctx->users))
free_ioctx(ctx);
}

@@ -409,8 +409,11 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

ctx->max_reqs = nr_events;

- atomic_set(&ctx->users, 2);
- atomic_set(&ctx->dead, 0);
+ percpu_ref_init(&ctx->users);
+ rcu_read_lock();
+ percpu_ref_get(&ctx->users);
+ rcu_read_unlock();
+
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->completion_lock);
mutex_init(&ctx->ring_lock);
@@ -482,7 +485,7 @@ static void kill_ioctx_rcu(struct rcu_head *head)
*/
static void kill_ioctx(struct kioctx *ctx)
{
- if (!atomic_xchg(&ctx->dead, 1)) {
+ if (percpu_ref_kill(&ctx->users)) {
hlist_del_rcu(&ctx->list);
/* Between hlist_del_rcu() and dropping the initial ref */
synchronize_rcu();
@@ -528,12 +531,6 @@ void exit_aio(struct mm_struct *mm)
struct hlist_node *n;

hlist_for_each_entry_safe(ctx, n, &mm->ioctx_list, list) {
- if (1 != atomic_read(&ctx->users))
- printk(KERN_DEBUG
- "exit_aio:ioctx still alive: %d %d %d\n",
- atomic_read(&ctx->users),
- atomic_read(&ctx->dead),
- atomic_read(&ctx->reqs_available));
/*
* We don't need to bother with munmap() here -
* exit_mmap(mm) is coming and it'll unmap everything.
@@ -544,7 +541,7 @@ void exit_aio(struct mm_struct *mm)
*/
ctx->mmap_size = 0;

- if (!atomic_xchg(&ctx->dead, 1)) {
+ if (percpu_ref_kill(&ctx->users)) {
hlist_del_rcu(&ctx->list);
call_rcu(&ctx->rcu_head, kill_ioctx_rcu);
}
@@ -655,7 +652,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)

hlist_for_each_entry_rcu(ctx, &mm->ioctx_list, list) {
if (ctx->user_id == ctx_id){
- atomic_inc(&ctx->users);
+ percpu_ref_get(&ctx->users);
ret = ctx;
break;
}
@@ -867,7 +864,7 @@ static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
if (ret > 0)
*i += ret;

- if (unlikely(atomic_read(&ctx->dead)))
+ if (unlikely(percpu_ref_dead(&ctx->users)))
ret = -EINVAL;

if (!*i)
--
1.8.1.3

2013-03-21 16:44:26

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 20/33] aio: give shared kioctx fields their own cachelines

[[email protected]: make reqs_active __cacheline_aligned_in_smp]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0e283ad..b71691d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -67,13 +67,6 @@ struct kioctx {
unsigned long user_id;
struct hlist_node list;

- wait_queue_head_t wait;
-
- spinlock_t ctx_lock;
-
- atomic_t reqs_active;
- struct list_head active_reqs; /* used for cancellation */
-
/*
* This is what userspace passed to io_setup(), it's not used for
* anything but counting against the global max_reqs quota.
@@ -92,19 +85,29 @@ struct kioctx {
struct page **ring_pages;
long nr_pages;

+ struct rcu_head rcu_head;
+ struct work_struct rcu_work;
+
+ struct {
+ atomic_t reqs_active;
+ } ____cacheline_aligned_in_smp;
+
+ struct {
+ spinlock_t ctx_lock;
+ struct list_head active_reqs; /* used for cancellation */
+ } ____cacheline_aligned_in_smp;
+
struct {
struct mutex ring_lock;
- } ____cacheline_aligned;
+ wait_queue_head_t wait;
+ } ____cacheline_aligned_in_smp;

struct {
unsigned tail;
spinlock_t completion_lock;
- } ____cacheline_aligned;
+ } ____cacheline_aligned_in_smp;

struct page *internal_pages[AIO_RING_PAGES];
-
- struct rcu_head rcu_head;
- struct work_struct rcu_work;
};

/*------ sysctl variables----*/
--
1.8.1.3

2013-03-21 16:44:47

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 33/33] aio: fix kioctx not being freed after cancellation at exit time

From: Benjamin LaHaise <[email protected]>

The recent changes overhauling fs/aio.c introduced a bug that results in the
kioctx not being freed when outstanding kiocbs are cancelled at exit_aio()
time. Specifically, a kiocb that is cancelled has its completion events
discarded by batch_complete_aio(), which then fails to wake up the process
stuck in free_ioctx(). Fix this by removing the event suppression in
batch_complete_aio() and modify the wait_event() condition in free_ioctx()
appropriately.

This patch was tested with the cancel operation in the thread based code
posted yesterday.

Signed-off-by: Benjamin LaHaise <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Josh Boyer <[email protected]>
Cc: Zach Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
fs/aio.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4dbd240..d2c1a82 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -349,7 +349,9 @@ static void free_ioctx(struct kioctx *ctx)
kunmap_atomic(ring);

while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
- wait_event(ctx->wait, head != ctx->shadow_tail);
+ wait_event(ctx->wait,
+ (head != ctx->shadow_tail) ||
+ (atomic_read(&ctx->reqs_available) >= ctx->nr_events - 1));

avail = (head <= ctx->shadow_tail
? ctx->shadow_tail : ctx->nr_events) - head;
@@ -774,17 +776,6 @@ void batch_complete_aio(struct batch_complete *batch)
n = rb_parent(n);
}

- if (unlikely(xchg(&req->ki_cancel,
- KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
- /*
- * Can't use the percpu reqs_available here - could race
- * with free_ioctx()
- */
- atomic_inc(&req->ki_ctx->reqs_available);
- aio_put_req(req);
- continue;
- }
-
if (unlikely(req->ki_eventfd != eventfd)) {
if (eventfd) {
/* Make event visible */
--
1.8.1.3

2013-03-21 16:45:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 29/33] block: Prep work for batch completion

Add a struct batch_complete * argument to bi_end_io; infrastructure to
make use of it comes in the next patch.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
---
block/blk-flush.c | 3 ++-
block/blk-lib.c | 3 ++-
drivers/block/drbd/drbd_bitmap.c | 2 +-
drivers/block/drbd/drbd_worker.c | 6 +++---
drivers/block/drbd/drbd_wrappers.h | 9 ++++++---
drivers/block/floppy.c | 3 ++-
drivers/block/pktcdvd.c | 9 ++++++---
drivers/block/xen-blkback/blkback.c | 3 ++-
drivers/md/dm-bufio.c | 9 +++++----
drivers/md/dm-crypt.c | 3 ++-
drivers/md/dm-io.c | 2 +-
drivers/md/dm-snap.c | 3 ++-
drivers/md/dm-thin.c | 3 ++-
drivers/md/dm-verity.c | 3 ++-
drivers/md/dm.c | 6 ++++--
drivers/md/faulty.c | 3 ++-
drivers/md/md.c | 9 ++++++---
drivers/md/multipath.c | 3 ++-
drivers/md/raid1.c | 15 ++++++++++-----
drivers/md/raid10.c | 21 ++++++++++++++-------
drivers/md/raid5.c | 15 ++++++++++-----
drivers/target/target_core_iblock.c | 6 ++++--
drivers/target/target_core_pscsi.c | 3 ++-
fs/bio-integrity.c | 3 ++-
fs/bio.c | 14 +++++++++-----
fs/btrfs/check-integrity.c | 14 +++++++++-----
fs/btrfs/compression.c | 6 ++++--
fs/btrfs/disk-io.c | 6 ++++--
fs/btrfs/extent_io.c | 12 ++++++++----
fs/btrfs/inode.c | 13 ++++++++-----
fs/btrfs/scrub.c | 18 ++++++++++++------
fs/btrfs/volumes.c | 4 ++--
fs/buffer.c | 3 ++-
fs/direct-io.c | 9 +++------
fs/ext4/page-io.c | 3 ++-
fs/f2fs/data.c | 2 +-
fs/f2fs/segment.c | 3 ++-
fs/gfs2/lops.c | 3 ++-
fs/gfs2/ops_fstype.c | 3 ++-
fs/hfsplus/wrapper.c | 3 ++-
fs/jfs/jfs_logmgr.c | 4 ++--
fs/jfs/jfs_metapage.c | 6 ++++--
fs/logfs/dev_bdev.c | 8 +++++---
fs/mpage.c | 2 +-
fs/nfs/blocklayout/blocklayout.c | 17 ++++++++++-------
fs/nilfs2/segbuf.c | 3 ++-
fs/ocfs2/cluster/heartbeat.c | 4 ++--
fs/xfs/xfs_aops.c | 3 ++-
fs/xfs/xfs_buf.c | 3 ++-
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 3 ++-
include/linux/fs.h | 2 +-
include/linux/swap.h | 3 ++-
mm/bounce.c | 12 ++++++++----
mm/page_io.c | 5 +++--
55 files changed, 213 insertions(+), 125 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index db8f1b5..d994710 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -384,7 +384,8 @@ void blk_abort_flushes(struct request_queue *q)
}
}

-static void bio_end_flush(struct bio *bio, int err)
+static void bio_end_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index d6f50d5..279f9de 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -15,7 +15,8 @@ struct bio_batch {
struct completion *wait;
};

-static void bio_batch_end_io(struct bio *bio, int err)
+static void bio_batch_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_batch *bb = bio->bi_private;

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 8dc2950..e366499 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -937,7 +937,7 @@ static void bm_aio_ctx_destroy(struct kref *kref)
}

/* bv_page may be a copy, or may be the original */
-static void bm_async_io_complete(struct bio *bio, int error)
+static void bm_async_io_complete(struct bio *bio, int error, struct batch_complete *batch)
{
struct bm_aio_ctx *ctx = bio->bi_private;
struct drbd_conf *mdev = ctx->mdev;
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 424dc7b..34f7ab1 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -64,7 +64,7 @@ rwlock_t global_state_lock;
/* used for synchronous meta data and bitmap IO
* submitted by drbd_md_sync_page_io()
*/
-void drbd_md_io_complete(struct bio *bio, int error)
+void drbd_md_io_complete(struct bio *bio, int error, struct batch_complete *batch)
{
struct drbd_md_io *md_io;
struct drbd_conf *mdev;
@@ -166,7 +166,7 @@ static void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req) __rel
/* writes on behalf of the partner, or resync writes,
* "submitted" by the receiver.
*/
-void drbd_peer_request_endio(struct bio *bio, int error)
+void drbd_peer_request_endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct drbd_peer_request *peer_req = bio->bi_private;
struct drbd_conf *mdev = peer_req->w.mdev;
@@ -202,7 +202,7 @@ void drbd_peer_request_endio(struct bio *bio, int error)

/* read, readA or write requests on R_PRIMARY coming from drbd_make_request
*/
-void drbd_request_endio(struct bio *bio, int error)
+void drbd_request_endio(struct bio *bio, int error, struct batch_complete *batch)
{
unsigned long flags;
struct drbd_request *req = bio->bi_private;
diff --git a/drivers/block/drbd/drbd_wrappers.h b/drivers/block/drbd/drbd_wrappers.h
index 328f18e..d443dc0 100644
--- a/drivers/block/drbd/drbd_wrappers.h
+++ b/drivers/block/drbd/drbd_wrappers.h
@@ -20,9 +20,12 @@ static inline void drbd_set_my_capacity(struct drbd_conf *mdev,
#define drbd_bio_uptodate(bio) bio_flagged(bio, BIO_UPTODATE)

/* bi_end_io handlers */
-extern void drbd_md_io_complete(struct bio *bio, int error);
-extern void drbd_peer_request_endio(struct bio *bio, int error);
-extern void drbd_request_endio(struct bio *bio, int error);
+extern void drbd_md_io_complete(struct bio *bio, int error,
+ struct batch_complete *batch);
+extern void drbd_peer_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch);
+extern void drbd_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch);

/*
* used to submit our private bio
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 2ddd64a..4ad77d8 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3748,7 +3748,8 @@ static unsigned int floppy_check_events(struct gendisk *disk,
* a disk in the drive, and whether that disk is writable.
*/

-static void floppy_rb0_complete(struct bio *bio, int err)
+static void floppy_rb0_complete(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2e7de7a..50f5722 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1005,7 +1005,8 @@ static void pkt_make_local_copy(struct packet_data *pkt, struct bio_vec *bvec)
}
}

-static void pkt_end_io_read(struct bio *bio, int err)
+static void pkt_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_data *pkt = bio->bi_private;
struct pktcdvd_device *pd = pkt->pd;
@@ -1023,7 +1024,8 @@ static void pkt_end_io_read(struct bio *bio, int err)
pkt_bio_finished(pd);
}

-static void pkt_end_io_packet_write(struct bio *bio, int err)
+static void pkt_end_io_packet_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_data *pkt = bio->bi_private;
struct pktcdvd_device *pd = pkt->pd;
@@ -2395,7 +2397,8 @@ static int pkt_close(struct gendisk *disk, fmode_t mode)
}


-static void pkt_end_io_read_cloned(struct bio *bio, int err)
+static void pkt_end_io_read_cloned(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_stacked_data *psd = bio->bi_private;
struct pktcdvd_device *pd = psd->pd;
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index de1f319..f0c6fff 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -740,7 +740,8 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
/*
* bio callback.
*/
-static void end_block_io_op(struct bio *bio, int error)
+static void end_block_io_op(struct bio *bio, int error,
+ struct batch_complete *batch)
{
__end_block_io_op(bio->bi_private, error);
bio_put(bio);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 3c955e1..0b9ae79 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -472,7 +472,7 @@ static void dmio_complete(unsigned long error, void *context)
{
struct dm_buffer *b = context;

- b->bio.bi_end_io(&b->bio, error ? -EIO : 0);
+ b->bio.bi_end_io(&b->bio, error ? -EIO : 0, NULL);
}

static void use_dmio(struct dm_buffer *b, int rw, sector_t block,
@@ -503,7 +503,7 @@ static void use_dmio(struct dm_buffer *b, int rw, sector_t block,

r = dm_io(&io_req, 1, &region, NULL);
if (r)
- end_io(&b->bio, r);
+ end_io(&b->bio, r, NULL);
}

static void use_inline_bio(struct dm_buffer *b, int rw, sector_t block,
@@ -570,7 +570,8 @@ static void submit_io(struct dm_buffer *b, int rw, sector_t block,
* Set the error, clear B_WRITING bit and wake anyone who was waiting on
* it.
*/
-static void write_endio(struct bio *bio, int error)
+static void write_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);

@@ -943,7 +944,7 @@ found_buffer:
* The endio routine for reading: set the error, clear the bit and wake up
* anyone waiting on the buffer.
*/
-static void read_endio(struct bio *bio, int error)
+static void read_endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 13c1548..cffeff1 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -930,7 +930,8 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
* The work is done per CPU global for all dm-crypt instances.
* They should not depend on each other and do not block.
*/
-static void crypt_endio(struct bio *clone, int error)
+static void crypt_endio(struct bio *clone, int error,
+ struct batch_complete *batch)
{
struct dm_crypt_io *io = clone->bi_private;
struct crypt_config *cc = io->cc;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index ea49834..a727b26 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -136,7 +136,7 @@ static void dec_count(struct io *io, unsigned int region, int error)
}
}

-static void endio(struct bio *bio, int error)
+static void endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct io *io;
unsigned region;
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c0e0702..eb32e35 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1485,7 +1485,8 @@ static void start_copy(struct dm_snap_pending_exception *pe)
dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, copy_callback, pe);
}

-static void full_bio_end_io(struct bio *bio, int error)
+static void full_bio_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
void *callback_data = bio->bi_private;

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 009339d..3ae5614 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -553,7 +553,8 @@ static void copy_complete(int read_err, unsigned long write_err, void *context)
spin_unlock_irqrestore(&pool->lock, flags);
}

-static void overwrite_endio(struct bio *bio, int err)
+static void overwrite_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
unsigned long flags;
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 6ad5383..ba49b8d 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -406,7 +406,8 @@ static void verity_work(struct work_struct *w)
verity_finish_io(io, verity_verify_io(io));
}

-static void verity_end_io(struct bio *bio, int error)
+static void verity_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct dm_verity_io *io = bio->bi_private;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7e46926..a1e371a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -616,7 +616,8 @@ static void dec_pending(struct dm_io *io, int error)
}
}

-static void clone_endio(struct bio *bio, int error)
+static void clone_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int r = 0;
struct dm_target_io *tio = bio->bi_private;
@@ -651,7 +652,8 @@ static void clone_endio(struct bio *bio, int error)
/*
* Partial completion handling for request-based dm
*/
-static void end_clone_bio(struct bio *clone, int error)
+static void end_clone_bio(struct bio *clone, int error,
+ struct batch_complete *batch)
{
struct dm_rq_clone_bio_info *info = clone->bi_private;
struct dm_rq_target_io *tio = info->tio;
diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 5e7dc77..7ef4442 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -70,7 +70,8 @@
#include <linux/seq_file.h>


-static void faulty_fail(struct bio *bio, int error)
+static void faulty_fail(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bio *b = bio->bi_private;

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fcb878f..8639a07 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -385,7 +385,8 @@ EXPORT_SYMBOL(mddev_congested);
* Generic flush handling for md
*/

-static void md_end_flush(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct md_rdev *rdev = bio->bi_private;
struct mddev *mddev = rdev->mddev;
@@ -762,7 +763,8 @@ void md_rdev_clear(struct md_rdev *rdev)
}
EXPORT_SYMBOL_GPL(md_rdev_clear);

-static void super_written(struct bio *bio, int error)
+static void super_written(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct md_rdev *rdev = bio->bi_private;
struct mddev *mddev = rdev->mddev;
@@ -813,7 +815,8 @@ void md_super_wait(struct mddev *mddev)
finish_wait(&mddev->sb_wait, &wq);
}

-static void bi_complete(struct bio *bio, int error)
+static void bi_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
complete((struct completion*)bio->bi_private);
}
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 1642eae..fecad70 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -83,7 +83,8 @@ static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err)
mempool_free(mp_bh, conf->pool);
}

-static void multipath_end_request(struct bio *bio, int error)
+static void multipath_end_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct multipath_bh *mp_bh = bio->bi_private;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fd86b37..6c7b720 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -304,7 +304,8 @@ static int find_bio_disk(struct r1bio *r1_bio, struct bio *bio)
return mirror;
}

-static void raid1_end_read_request(struct bio *bio, int error)
+static void raid1_end_read_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
@@ -389,7 +390,8 @@ static void r1_bio_write_done(struct r1bio *r1_bio)
}
}

-static void raid1_end_write_request(struct bio *bio, int error)
+static void raid1_end_write_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
@@ -1621,7 +1623,8 @@ abort:
}


-static void end_sync_read(struct bio *bio, int error)
+static void end_sync_read(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct r1bio *r1_bio = bio->bi_private;

@@ -1639,7 +1642,8 @@ static void end_sync_read(struct bio *bio, int error)
reschedule_retry(r1_bio);
}

-static void end_sync_write(struct bio *bio, int error)
+static void end_sync_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
@@ -2059,7 +2063,8 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
}
}

-static void bi_complete(struct bio *bio, int error)
+static void bi_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 77b562d..331f872 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -101,7 +101,8 @@ static int enough(struct r10conf *conf, int ignore);
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
int *skipped);
static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
-static void end_reshape_write(struct bio *bio, int error);
+static void end_reshape_write(struct bio *bio, int error,
+ struct batch_complete *batch);
static void end_reshape(struct r10conf *conf);

static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
@@ -358,7 +359,8 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
return r10_bio->devs[slot].devnum;
}

-static void raid10_end_read_request(struct bio *bio, int error)
+static void raid10_end_read_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -441,7 +443,8 @@ static void one_write_done(struct r10bio *r10_bio)
}
}

-static void raid10_end_write_request(struct bio *bio, int error)
+static void raid10_end_write_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -1909,7 +1912,8 @@ abort:
}


-static void end_sync_read(struct bio *bio, int error)
+static void end_sync_read(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct r10bio *r10_bio = bio->bi_private;
struct r10conf *conf = r10_bio->mddev->private;
@@ -1970,7 +1974,8 @@ static void end_sync_request(struct r10bio *r10_bio)
}
}

-static void end_sync_write(struct bio *bio, int error)
+static void end_sync_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -2531,7 +2536,8 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
}
}

-static void bi_complete(struct bio *bio, int error)
+static void bi_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -4612,7 +4618,8 @@ static int handle_reshape_read_error(struct mddev *mddev,
return 0;
}

-static void end_reshape_write(struct bio *bio, int error)
+static void end_reshape_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ee2912..5c54f98 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -530,9 +530,11 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
}

static void
-raid5_end_read_request(struct bio *bi, int error);
+raid5_end_read_request(struct bio *bi, int error,
+ struct batch_complete *batch);
static void
-raid5_end_write_request(struct bio *bi, int error);
+raid5_end_write_request(struct bio *bi, int error,
+ struct batch_complete *batch);

static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
{
@@ -1709,7 +1711,8 @@ static void shrink_stripes(struct r5conf *conf)
conf->slab_cache = NULL;
}

-static void raid5_end_read_request(struct bio * bi, int error)
+static void raid5_end_read_request(struct bio * bi, int error,
+ struct batch_complete *batch)
{
struct stripe_head *sh = bi->bi_private;
struct r5conf *conf = sh->raid_conf;
@@ -1829,7 +1832,8 @@ static void raid5_end_read_request(struct bio * bi, int error)
release_stripe(sh);
}

-static void raid5_end_write_request(struct bio *bi, int error)
+static void raid5_end_write_request(struct bio *bi, int error,
+ struct batch_complete *batch)
{
struct stripe_head *sh = bi->bi_private;
struct r5conf *conf = sh->raid_conf;
@@ -3860,7 +3864,8 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
* first).
* If the read failed..
*/
-static void raid5_align_endio(struct bio *bi, int error)
+static void raid5_align_endio(struct bio *bi, int error,
+ struct batch_complete *batch)
{
struct bio* raid_bi = bi->bi_private;
struct mddev *mddev;
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 8bcc514..c2e5ca9 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -271,7 +271,8 @@ static void iblock_complete_cmd(struct se_cmd *cmd)
kfree(ibr);
}

-static void iblock_bio_done(struct bio *bio, int err)
+static void iblock_bio_done(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct se_cmd *cmd = bio->bi_private;
struct iblock_req *ibr = cmd->priv;
@@ -335,7 +336,8 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
blk_finish_plug(&plug);
}

-static void iblock_end_io_flush(struct bio *bio, int err)
+static void iblock_end_io_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct se_cmd *cmd = bio->bi_private;

diff --git a/drivers/target/target_core_pscsi.c b/drivers/target/target_core_pscsi.c
index 82e78d7..c234bca 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -835,7 +835,8 @@ static ssize_t pscsi_show_configfs_dev_params(struct se_device *dev, char *b)
return bl;
}

-static void pscsi_bi_endio(struct bio *bio, int error)
+static void pscsi_bi_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
bio_put(bio);
}
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index a3f28f3..293754c 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -544,7 +544,8 @@ static void bio_integrity_verify_fn(struct work_struct *work)
* in process context. This function postpones completion
* accordingly.
*/
-void bio_integrity_endio(struct bio *bio, int error)
+void bio_integrity_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bio_integrity_payload *bip = bio->bi_integrity;

diff --git a/fs/bio.c b/fs/bio.c
index bb5768f..b2f9c0d 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1136,7 +1136,8 @@ void bio_unmap_user(struct bio *bio)
}
EXPORT_SYMBOL(bio_unmap_user);

-static void bio_map_kern_endio(struct bio *bio, int err)
+static void bio_map_kern_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
bio_put(bio);
}
@@ -1208,7 +1209,8 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
}
EXPORT_SYMBOL(bio_map_kern);

-static void bio_copy_kern_endio(struct bio *bio, int err)
+static void bio_copy_kern_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_vec *bvec;
const int read = bio_data_dir(bio) == READ;
@@ -1431,7 +1433,7 @@ void bio_endio(struct bio *bio, int error)
trace_block_bio_complete(bio, error);

if (bio->bi_end_io)
- bio->bi_end_io(bio, error);
+ bio->bi_end_io(bio, error, NULL);
}
EXPORT_SYMBOL(bio_endio);

@@ -1446,7 +1448,8 @@ void bio_pair_release(struct bio_pair *bp)
}
EXPORT_SYMBOL(bio_pair_release);

-static void bio_pair_end_1(struct bio *bi, int err)
+static void bio_pair_end_1(struct bio *bi, int err,
+ struct batch_complete *batch)
{
struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);

@@ -1456,7 +1459,8 @@ static void bio_pair_end_1(struct bio *bi, int err)
bio_pair_release(bp);
}

-static void bio_pair_end_2(struct bio *bi, int err)
+static void bio_pair_end_2(struct bio *bi, int err,
+ struct batch_complete *batch)
{
struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 18af6f4..3c617b3 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -323,7 +323,8 @@ static void btrfsic_release_block_ctx(struct btrfsic_block_data_ctx *block_ctx);
static int btrfsic_read_block(struct btrfsic_state *state,
struct btrfsic_block_data_ctx *block_ctx);
static void btrfsic_dump_database(struct btrfsic_state *state);
-static void btrfsic_complete_bio_end_io(struct bio *bio, int err);
+static void btrfsic_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static int btrfsic_test_for_metadata(struct btrfsic_state *state,
char **datav, unsigned int num_pages);
static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state,
@@ -336,7 +337,8 @@ static int btrfsic_process_written_superblock(
struct btrfsic_state *state,
struct btrfsic_block *const block,
struct btrfs_super_block *const super_hdr);
-static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status);
+static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status,
+ struct batch_complete *batch);
static void btrfsic_bh_end_io(struct buffer_head *bh, int uptodate);
static int btrfsic_is_block_ref_by_superblock(const struct btrfsic_state *state,
const struct btrfsic_block *block,
@@ -1751,7 +1753,8 @@ static int btrfsic_read_block(struct btrfsic_state *state,
return block_ctx->len;
}

-static void btrfsic_complete_bio_end_io(struct bio *bio, int err)
+static void btrfsic_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -2294,7 +2297,8 @@ continue_loop:
goto again;
}

-static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status)
+static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status,
+ struct batch_complete *batch)
{
struct btrfsic_block *block = (struct btrfsic_block *)bp->bi_private;
int iodone_w_error;
@@ -2342,7 +2346,7 @@ static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status)
block = next_block;
} while (NULL != block);

- bp->bi_end_io(bp, bio_error_status);
+ bp->bi_end_io(bp, bio_error_status, batch);
}

static void btrfsic_bh_end_io(struct buffer_head *bh, int uptodate)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15b9408..74ae115 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -153,7 +153,8 @@ fail:
* The compressed pages are freed here, and it must be run
* in process context
*/
-static void end_compressed_bio_read(struct bio *bio, int err)
+static void end_compressed_bio_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct compressed_bio *cb = bio->bi_private;
struct inode *inode;
@@ -263,7 +264,8 @@ static noinline void end_compressed_writeback(struct inode *inode, u64 start,
* This also calls the writeback end hooks for the file pages so that
* metadata and checksums can be updated in the file.
*/
-static void end_compressed_bio_write(struct bio *bio, int err)
+static void end_compressed_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct extent_io_tree *tree;
struct compressed_bio *cb = bio->bi_private;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d84651..46eaa85 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -669,7 +669,8 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
return -EIO; /* we fixed nothing */
}

-static void end_workqueue_bio(struct bio *bio, int err)
+static void end_workqueue_bio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct end_io_wq *end_io_wq = bio->bi_private;
struct btrfs_fs_info *fs_info;
@@ -2951,7 +2952,8 @@ static int write_dev_supers(struct btrfs_device *device,
* endio for the write_dev_flush, this will wake anyone waiting
* for the barrier when it is done
*/
-static void btrfs_end_empty_barrier(struct bio *bio, int err)
+static void btrfs_end_empty_barrier(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err) {
if (err == -EOPNOTSUPP)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f173c5a..5807813 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1904,7 +1904,8 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
return err;
}

-static void repair_io_failure_callback(struct bio *bio, int err)
+static void repair_io_failure_callback(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete(bio->bi_private);
}
@@ -2284,7 +2285,8 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
* Scheduling is not allowed, so the extent state tree is expected
* to have one and only one object corresponding to this IO.
*/
-static void end_bio_extent_writepage(struct bio *bio, int err)
+static void end_bio_extent_writepage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct extent_io_tree *tree;
@@ -2330,7 +2332,8 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
* Scheduling is not allowed, so the extent state tree is expected
* to have one and only one object corresponding to this IO.
*/
-static void end_bio_extent_readpage(struct bio *bio, int err)
+static void end_bio_extent_readpage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -3154,7 +3157,8 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
}

-static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
+static void end_bio_extent_buffer_writepage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
int uptodate = err == 0;
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ca26188..436d022 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6900,7 +6900,8 @@ struct btrfs_dio_private {
struct bio *orig_bio;
};

-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static void btrfs_endio_direct_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;
struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -6954,10 +6955,11 @@ failed:
/* If we had a csum failure make sure to clear the uptodate flag */
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- dio_end_io(bio, err);
+ dio_end_io(bio, err, batch);
}

-static void btrfs_endio_direct_write(struct bio *bio, int err)
+static void btrfs_endio_direct_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;
struct inode *inode = dip->inode;
@@ -6999,7 +7001,7 @@ out_done:
/* If we had an error make sure to clear the uptodate flag */
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- dio_end_io(bio, err);
+ dio_end_io(bio, err, batch);
}

static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
@@ -7013,7 +7015,8 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
return 0;
}

-static void btrfs_end_dio_bio(struct bio *bio, int err)
+static void btrfs_end_dio_bio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 53c3501..239d397 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -200,7 +200,8 @@ static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
int is_metadata, int have_csum,
const u8 *csum, u64 generation,
u16 csum_size);
-static void scrub_complete_bio_end_io(struct bio *bio, int err);
+static void scrub_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static int scrub_repair_block_from_good_copy(struct scrub_block *sblock_bad,
struct scrub_block *sblock_good,
int force_write);
@@ -223,7 +224,8 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
u64 physical, struct btrfs_device *dev, u64 flags,
u64 gen, int mirror_num, u8 *csum, int force,
u64 physical_for_dev_replace);
-static void scrub_bio_end_io(struct bio *bio, int err);
+static void scrub_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static void scrub_bio_end_io_worker(struct btrfs_work *work);
static void scrub_block_complete(struct scrub_block *sblock);
static void scrub_remap_extent(struct btrfs_fs_info *fs_info,
@@ -240,7 +242,8 @@ static void scrub_free_wr_ctx(struct scrub_wr_ctx *wr_ctx);
static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
struct scrub_page *spage);
static void scrub_wr_submit(struct scrub_ctx *sctx);
-static void scrub_wr_bio_end_io(struct bio *bio, int err);
+static void scrub_wr_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static void scrub_wr_bio_end_io_worker(struct btrfs_work *work);
static int write_page_nocow(struct scrub_ctx *sctx,
u64 physical_for_dev_replace, struct page *page);
@@ -1385,7 +1388,8 @@ static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
sblock->checksum_error = 1;
}

-static void scrub_complete_bio_end_io(struct bio *bio, int err)
+static void scrub_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -1585,7 +1589,8 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
btrfsic_submit_bio(WRITE, sbio->bio);
}

-static void scrub_wr_bio_end_io(struct bio *bio, int err)
+static void scrub_wr_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct scrub_bio *sbio = bio->bi_private;
struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
@@ -2055,7 +2060,8 @@ leave_nomem:
return 0;
}

-static void scrub_bio_end_io(struct bio *bio, int err)
+static void scrub_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct scrub_bio *sbio = bio->bi_private;
struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5989a92..6ee155e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5018,7 +5018,7 @@ static unsigned int extract_stripe_index_from_bio_private(void *bi_private)
return (unsigned int)((uintptr_t)bi_private) & 3;
}

-static void btrfs_end_bio(struct bio *bio, int err)
+static void btrfs_end_bio(struct bio *bio, int err, struct batch_complete *batch)
{
struct btrfs_bio *bbio = extract_bbio_from_bio_private(bio->bi_private);
int is_orig_bio = 0;
@@ -5075,7 +5075,7 @@ static void btrfs_end_bio(struct bio *bio, int err)
}
kfree(bbio);

- bio_endio(bio, err);
+ bio_endio_batch(bio, err, batch);
} else if (!is_orig_bio) {
bio_put(bio);
}
diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..a1eae65 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2884,7 +2884,8 @@ sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
}
EXPORT_SYMBOL(generic_block_bmap);

-static void end_bio_bh_io_sync(struct bio *bio, int err)
+static void end_bio_bh_io_sync(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct buffer_head *bh = bio->bi_private;

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4348b01..6ab9b88 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -324,12 +324,12 @@ static void dio_bio_end_io(struct bio *bio, int error)
* so that the DIO specific endio actions are dealt with after the filesystem
* has done it's completion work.
*/
-void dio_end_io(struct bio *bio, int error)
+void dio_end_io(struct bio *bio, int error, struct batch_complete *batch)
{
struct dio *dio = bio->bi_private;

if (dio->is_async)
- dio_bio_end_aio(bio, error);
+ dio_bio_end_aio(bio, error, batch);
else
dio_bio_end_io(bio, error);
}
@@ -350,10 +350,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,

bio->bi_bdev = bdev;
bio->bi_sector = first_sector;
- if (dio->is_async)
- bio->bi_end_io = dio_bio_end_aio;
- else
- bio->bi_end_io = dio_bio_end_io;
+ bio->bi_end_io = dio_end_io;

sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index d9903af..305ecfa 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -229,7 +229,8 @@ static void buffer_io_error(struct buffer_head *bh)
(unsigned long long)bh->b_blocknr);
}

-static void ext4_end_bio(struct bio *bio, int error)
+static void ext4_end_bio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
ext4_io_end_t *io_end = bio->bi_private;
struct inode *inode;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index d0ed4ba..f4711e8 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -301,7 +301,7 @@ struct page *get_new_data_page(struct inode *inode, pgoff_t index,
return page;
}

-static void read_end_io(struct bio *bio, int err)
+static void read_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 777f17e..9a3ae8d 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -601,7 +601,8 @@ static const struct segment_allocation default_salloc_ops = {
.allocate_segment = allocate_segment_by_default,
};

-static void f2fs_end_io_write(struct bio *bio, int err)
+static void f2fs_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index a505597..942a968 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -200,7 +200,8 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
*
*/

-static void gfs2_end_log_write(struct bio *bio, int error)
+static void gfs2_end_log_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct gfs2_sbd *sdp = bio->bi_private;
struct bio_vec *bvec;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 60ede2a..86eb657 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -155,7 +155,8 @@ static int gfs2_check_sb(struct gfs2_sbd *sdp, int silent)
return -EINVAL;
}

-static void end_bio_io_page(struct bio *bio, int error)
+static void end_bio_io_page(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

diff --git a/fs/hfsplus/wrapper.c b/fs/hfsplus/wrapper.c
index 90effcc..2e7ffba 100644
--- a/fs/hfsplus/wrapper.c
+++ b/fs/hfsplus/wrapper.c
@@ -24,7 +24,8 @@ struct hfsplus_wd {
u16 embed_count;
};

-static void hfsplus_end_io_sync(struct bio *bio, int err)
+static void hfsplus_end_io_sync(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index 2eb952c..1f3ee7a 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -2154,7 +2154,7 @@ static void lbmStartIO(struct lbuf * bp)
/* check if journaling to disk has been disabled */
if (log->no_integrity) {
bio->bi_size = 0;
- lbmIODone(bio, 0);
+ lbmIODone(bio, 0, NULL);
} else {
submit_bio(WRITE_SYNC, bio);
INCREMENT(lmStat.submitted);
@@ -2192,7 +2192,7 @@ static int lbmIOWait(struct lbuf * bp, int flag)
*
* executed at INTIODONE level
*/
-static void lbmIODone(struct bio *bio, int error)
+static void lbmIODone(struct bio *bio, int error, struct batch_complete *batch)
{
struct lbuf *bp = bio->bi_private;
struct lbuf *nextbp, *tail;
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 6740d34..6ba6757 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -283,7 +283,8 @@ static void last_read_complete(struct page *page)
unlock_page(page);
}

-static void metapage_read_end_io(struct bio *bio, int err)
+static void metapage_read_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

@@ -338,7 +339,8 @@ static void last_write_complete(struct page *page)
end_page_writeback(page);
}

-static void metapage_write_end_io(struct bio *bio, int err)
+static void metapage_write_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index e784a21..281a968 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -14,7 +14,8 @@

#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))

-static void request_complete(struct bio *bio, int err)
+static void request_complete(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -65,7 +66,8 @@ static int bdev_readpage(void *_sb, struct page *page)

static DECLARE_WAIT_QUEUE_HEAD(wq);

-static void writeseg_end_io(struct bio *bio, int err)
+static void writeseg_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -171,7 +173,7 @@ static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
}


-static void erase_end_io(struct bio *bio, int err)
+static void erase_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct super_block *sb = bio->bi_private;
diff --git a/fs/mpage.c b/fs/mpage.c
index 0face1c..a4089bb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -41,7 +41,7 @@
* status of that page is hard. See end_buffer_async_read() for the details.
* There is no point in duplicating all that complexity.
*/
-static void mpage_end_io(struct bio *bio, int err)
+static void mpage_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 434b93e..76cf695 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -143,7 +143,7 @@ bl_submit_bio(int rw, struct bio *bio)

static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par)
{
struct bio *bio;
@@ -167,7 +167,7 @@ static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
static struct bio *do_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par,
unsigned int offset, int len)
{
@@ -190,7 +190,7 @@ retry:
static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par)
{
return do_add_page_to_bio(bio, npg, rw, isect, page, be,
@@ -198,7 +198,8 @@ static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
}

/* This is basically copied from mpage_end_io_read */
-static void bl_end_io_read(struct bio *bio, int err)
+static void bl_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -380,7 +381,8 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
}
}

-static void bl_end_io_write_zero(struct bio *bio, int err)
+static void bl_end_io_write_zero(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -408,7 +410,8 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
put_parallel(par);
}

-static void bl_end_io_write(struct bio *bio, int err)
+static void bl_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -487,7 +490,7 @@ map_block(struct buffer_head *bh, sector_t isect, struct pnfs_block_extent *be)
}

static void
-bl_read_single_end_io(struct bio *bio, int error)
+bl_read_single_end_io(struct bio *bio, int error, struct batch_complete *batch)
{
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct page *page = bvec->bv_page;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index dc9a913..680b65b 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -338,7 +338,8 @@ void nilfs_add_checksums_on_logs(struct list_head *logs, u32 seed)
/*
* BIO operations
*/
-static void nilfs_end_bio_write(struct bio *bio, int err)
+static void nilfs_end_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct nilfs_segment_buffer *segbuf = bio->bi_private;
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 42252bf..73ed9d6 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -380,8 +380,8 @@ static void o2hb_wait_on_io(struct o2hb_region *reg,
wait_for_completion(&wc->wc_io_complete);
}

-static void o2hb_bio_end_io(struct bio *bio,
- int error)
+static void o2hb_bio_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct o2hb_bio_wait_ctxt *wc = bio->bi_private;

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c24ce0e..32e5be8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -380,7 +380,8 @@ xfs_imap_valid(
STATIC void
xfs_end_bio(
struct bio *bio,
- int error)
+ int error,
+ struct batch_complete *batch)
{
xfs_ioend_t *ioend = bio->bi_private;

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4e8f0df..cb79d41 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1222,7 +1222,8 @@ _xfs_buf_ioend(
STATIC void
xfs_buf_bio_end_io(
struct bio *bio,
- int error)
+ int error,
+ struct batch_complete *batch)
{
xfs_buf_t *bp = (xfs_buf_t *)bio->bi_private;

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 820e7aa..1d077bd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -553,7 +553,7 @@ extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_prep(struct bio *);
-extern void bio_integrity_endio(struct bio *, int);
+extern void bio_integrity_endio(struct bio *, int, struct batch_complete *);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cdf1119..a3f578b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -16,7 +16,8 @@ struct page;
struct block_device;
struct io_context;
struct cgroup_subsys_state;
-typedef void (bio_end_io_t) (struct bio *, int);
+struct batch_complete;
+typedef void (bio_end_io_t) (struct bio *, int, struct batch_complete *);
typedef void (bio_destructor_t) (struct bio *);

/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..9032438 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2435,7 +2435,7 @@ enum {
DIO_SKIP_HOLES = 0x02,
};

-void dio_end_io(struct bio *bio, int error);
+void dio_end_io(struct bio *bio, int error, struct batch_complete *batch);

ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..7429973 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -331,7 +331,8 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
extern int swap_readpage(struct page *);
extern int swap_writepage(struct page *page, struct writeback_control *wbc);
extern int swap_set_page_dirty(struct page *page);
-extern void end_swap_bio_read(struct bio *bio, int err);
+extern void end_swap_bio_read(struct bio *bio, int err,
+ struct batch_complete *batch);

int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
unsigned long nr_pages, sector_t start_block);
diff --git a/mm/bounce.c b/mm/bounce.c
index 5f89017..38d8a3a 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -147,12 +147,14 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
bio_put(bio);
}

-static void bounce_end_io_write(struct bio *bio, int err)
+static void bounce_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
bounce_end_io(bio, page_pool, err);
}

-static void bounce_end_io_write_isa(struct bio *bio, int err)
+static void bounce_end_io_write_isa(struct bio *bio, int err,
+ struct batch_complete *batch)
{

bounce_end_io(bio, isa_page_pool, err);
@@ -168,12 +170,14 @@ static void __bounce_end_io_read(struct bio *bio, mempool_t *pool, int err)
bounce_end_io(bio, pool, err);
}

-static void bounce_end_io_read(struct bio *bio, int err)
+static void bounce_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
__bounce_end_io_read(bio, page_pool, err);
}

-static void bounce_end_io_read_isa(struct bio *bio, int err)
+static void bounce_end_io_read_isa(struct bio *bio, int err,
+ struct batch_complete *batch)
{
__bounce_end_io_read(bio, isa_page_pool, err);
}
diff --git a/mm/page_io.c b/mm/page_io.c
index c535d39..8800095 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -43,7 +43,8 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
return bio;
}

-static void end_swap_bio_write(struct bio *bio, int err)
+static void end_swap_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
@@ -69,7 +70,7 @@ static void end_swap_bio_write(struct bio *bio, int err)
bio_put(bio);
}

-void end_swap_bio_read(struct bio *bio, int err)
+void end_swap_bio_read(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
--
1.8.1.3

2013-03-21 16:36:18

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 06/33] aio: kill return value of aio_complete()

Nothing used the return value, and it probably wasn't possible to use it
safely for the locked versions (aio_complete(), aio_put_req()). Just kill
it.

Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 21 +++++++--------------
include/linux/aio.h | 8 ++++----
2 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f9a7e6a..6b29e41a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -531,7 +531,7 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
/* __aio_put_req
* Returns true if this put was the last user of the request.
*/
-static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
+static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));
@@ -541,7 +541,7 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
- return 0;
+ return;
list_del(&req->ki_list); /* remove from active_reqs */
req->ki_cancel = NULL;
req->ki_retry = NULL;
@@ -549,21 +549,18 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
fput(req->ki_filp);
req->ki_filp = NULL;
really_put_req(ctx, req);
- return 1;
}

/* aio_put_req
* Returns true if this put was the last user of the kiocb,
* false if the request is still in use.
*/
-int aio_put_req(struct kiocb *req)
+void aio_put_req(struct kiocb *req)
{
struct kioctx *ctx = req->ki_ctx;
- int ret;
spin_lock_irq(&ctx->ctx_lock);
- ret = __aio_put_req(ctx, req);
+ __aio_put_req(ctx, req);
spin_unlock_irq(&ctx->ctx_lock);
- return ret;
}
EXPORT_SYMBOL(aio_put_req);

@@ -593,10 +590,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)

/* aio_complete
* Called when the io request on the given iocb is complete.
- * Returns true if this is the last user of the request. The
- * only other user of the request can be the cancellation code.
*/
-int aio_complete(struct kiocb *iocb, long res, long res2)
+void aio_complete(struct kiocb *iocb, long res, long res2)
{
struct kioctx *ctx = iocb->ki_ctx;
struct aio_ring_info *info;
@@ -604,7 +599,6 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
struct io_event *event;
unsigned long flags;
unsigned long tail;
- int ret;

/*
* Special case handling for sync iocbs:
@@ -618,7 +612,7 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
iocb->ki_user_data = res;
iocb->ki_users = 0;
wake_up_process(iocb->ki_obj.tsk);
- return 1;
+ return;
}

info = &ctx->ring_info;
@@ -677,7 +671,7 @@ int aio_complete(struct kiocb *iocb, long res, long res2)

put_rq:
/* everything turned out well, dispose of the aiocb. */
- ret = __aio_put_req(ctx, iocb);
+ __aio_put_req(ctx, iocb);

/*
* We have to order our ring_info tail store above and test
@@ -691,7 +685,6 @@ put_rq:
wake_up(&ctx->wait);

spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- return ret;
}
EXPORT_SYMBOL(aio_complete);

diff --git a/include/linux/aio.h b/include/linux/aio.h
index 019204e..615d55a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -167,16 +167,16 @@ struct kioctx {
/* prototypes */
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
-extern int aio_put_req(struct kiocb *iocb);
-extern int aio_complete(struct kiocb *iocb, long res, long res2);
+extern void aio_put_req(struct kiocb *iocb);
+extern void aio_complete(struct kiocb *iocb, long res, long res2);
struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user *__user *iocbpp, bool compat);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
-static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-static inline int aio_complete(struct kiocb *iocb, long res, long res2) { return 0; }
+static inline void aio_put_req(struct kiocb *iocb) { }
+static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
struct mm_struct;
static inline void exit_aio(struct mm_struct *mm) { }
static inline long do_io_submit(aio_context_t ctx_id, long nr,
--
1.8.1.3

2013-03-21 16:45:43

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 11/33] aio: make aio_put_req() lockless

Freeing a kiocb needed to touch the kioctx for three things:

* Pull it off the reqs_active list
* Decrementing reqs_active
* Issuing a wakeup, if the kioctx was in the process of being freed.

This patch moves these to aio_complete(), for a couple reasons:

* aio_complete() already has to issue the wakeup, so if we drop the
kioctx refcount before aio_complete does its wakeup we don't have to
do it twice.
* aio_complete currently has to take the kioctx lock, so it makes sense
for it to pull the kiocb off the reqs_active list too.
* A later patch is going to change reqs_active to include unreaped
completions - this will mean allocating a kiocb doesn't have to look
at the ringbuffer. So taking the decrement of reqs_active out of
kiocb_free() is useful prep work for that patch.

This doesn't really affect cancellation, since existing (usb) code that
implements a cancel function still calls aio_complete() - we just have
to make sure that aio_complete does the necessary teardown for cancelled
kiocbs.

It does affect code paths where we free kiocbs that were never
submitted; they need to decrement reqs_active and pull the kiocb off the
reqs_active list. This occurs in two places: kiocb_batch_free(), which
is going away in a later patch, and the error path in io_submit_one.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 85 +++++++++++++++++++++--------------------------------
include/linux/aio.h | 4 +--
2 files changed, 35 insertions(+), 54 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4f23d43..3524bb2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -89,7 +89,7 @@ struct kioctx {

spinlock_t ctx_lock;

- int reqs_active;
+ atomic_t reqs_active;
struct list_head active_reqs; /* used for cancellation */

/* sys_io_setup currently limits this to an unsigned int */
@@ -250,7 +250,7 @@ static void ctx_rcu_free(struct rcu_head *head)
static void __put_ioctx(struct kioctx *ctx)
{
unsigned nr_events = ctx->max_reqs;
- BUG_ON(ctx->reqs_active);
+ BUG_ON(atomic_read(&ctx->reqs_active));

aio_free_ring(ctx);
if (nr_events) {
@@ -284,7 +284,7 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
cancel = kiocb->ki_cancel;
kiocbSetCancelled(kiocb);
if (cancel) {
- kiocb->ki_users++;
+ atomic_inc(&kiocb->ki_users);
spin_unlock_irq(&ctx->ctx_lock);

memset(res, 0, sizeof(*res));
@@ -383,12 +383,12 @@ static void kill_ctx(struct kioctx *ctx)
kiocb_cancel(ctx, req, &res);
}

- if (!ctx->reqs_active)
+ if (!atomic_read(&ctx->reqs_active))
goto out;

add_wait_queue(&ctx->wait, &wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- while (ctx->reqs_active) {
+ while (atomic_read(&ctx->reqs_active)) {
spin_unlock_irq(&ctx->ctx_lock);
io_schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -406,9 +406,9 @@ out:
*/
ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
{
- while (iocb->ki_users) {
+ while (atomic_read(&iocb->ki_users)) {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (!iocb->ki_users)
+ if (!atomic_read(&iocb->ki_users))
break;
io_schedule();
}
@@ -438,7 +438,7 @@ void exit_aio(struct mm_struct *mm)
printk(KERN_DEBUG
"exit_aio:ioctx still alive: %d %d %d\n",
atomic_read(&ctx->users), ctx->dead,
- ctx->reqs_active);
+ atomic_read(&ctx->reqs_active));
/*
* We don't need to bother with munmap() here -
* exit_mmap(mm) is coming and it'll unmap everything.
@@ -453,11 +453,11 @@ void exit_aio(struct mm_struct *mm)
}

/* aio_get_req
- * Allocate a slot for an aio request. Increments the users count
+ * Allocate a slot for an aio request. Increments the ki_users count
* of the kioctx so that the kioctx stays around until all requests are
* complete. Returns NULL if no requests are free.
*
- * Returns with kiocb->users set to 2. The io submit code path holds
+ * Returns with kiocb->ki_users set to 2. The io submit code path holds
* an extra reference while submitting the i/o.
* This prevents races between the aio code path referencing the
* req (after submitting it) and aio_complete() freeing the req.
@@ -471,7 +471,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
return NULL;

req->ki_flags = 0;
- req->ki_users = 2;
+ atomic_set(&req->ki_users, 2);
req->ki_key = 0;
req->ki_ctx = ctx;
req->ki_cancel = NULL;
@@ -512,9 +512,9 @@ static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
list_del(&req->ki_batch);
list_del(&req->ki_list);
kmem_cache_free(kiocb_cachep, req);
- ctx->reqs_active--;
+ atomic_dec(&ctx->reqs_active);
}
- if (unlikely(!ctx->reqs_active && ctx->dead))
+ if (unlikely(!atomic_read(&ctx->reqs_active) && ctx->dead))
wake_up_all(&ctx->wait);
spin_unlock_irq(&ctx->ctx_lock);
}
@@ -545,7 +545,7 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
spin_lock_irq(&ctx->ctx_lock);
ring = kmap_atomic(ctx->ring_info.ring_pages[0]);

- avail = aio_ring_avail(&ctx->ring_info, ring) - ctx->reqs_active;
+ avail = aio_ring_avail(&ctx->ring_info, ring) - atomic_read(&ctx->reqs_active);
BUG_ON(avail < 0);
if (avail < allocated) {
/* Trim back the number of requests. */
@@ -560,7 +560,7 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
batch->count -= allocated;
list_for_each_entry(req, &batch->head, ki_batch) {
list_add(&req->ki_list, &ctx->active_reqs);
- ctx->reqs_active++;
+ atomic_inc(&ctx->reqs_active);
}

kunmap_atomic(ring);
@@ -583,10 +583,8 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx,
return req;
}

-static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
+static void kiocb_free(struct kiocb *req)
{
- assert_spin_locked(&ctx->ctx_lock);
-
if (req->ki_filp)
fput(req->ki_filp);
if (req->ki_eventfd != NULL)
@@ -596,40 +594,12 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
if (req->ki_iovec != &req->ki_inline_vec)
kfree(req->ki_iovec);
kmem_cache_free(kiocb_cachep, req);
- ctx->reqs_active--;
-
- if (unlikely(!ctx->reqs_active && ctx->dead))
- wake_up_all(&ctx->wait);
}

-/* __aio_put_req
- * Returns true if this put was the last user of the request.
- */
-static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
-{
- assert_spin_locked(&ctx->ctx_lock);
-
- req->ki_users--;
- BUG_ON(req->ki_users < 0);
- if (likely(req->ki_users))
- return;
- list_del(&req->ki_list); /* remove from active_reqs */
- req->ki_cancel = NULL;
- req->ki_retry = NULL;
-
- really_put_req(ctx, req);
-}
-
-/* aio_put_req
- * Returns true if this put was the last user of the kiocb,
- * false if the request is still in use.
- */
void aio_put_req(struct kiocb *req)
{
- struct kioctx *ctx = req->ki_ctx;
- spin_lock_irq(&ctx->ctx_lock);
- __aio_put_req(ctx, req);
- spin_unlock_irq(&ctx->ctx_lock);
+ if (atomic_dec_and_test(&req->ki_users))
+ kiocb_free(req);
}
EXPORT_SYMBOL(aio_put_req);

@@ -677,9 +647,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
* - the sync task helpfully left a reference to itself in the iocb
*/
if (is_sync_kiocb(iocb)) {
- BUG_ON(iocb->ki_users != 1);
+ BUG_ON(atomic_read(&iocb->ki_users) != 1);
iocb->ki_user_data = res;
- iocb->ki_users = 0;
+ atomic_set(&iocb->ki_users, 0);
wake_up_process(iocb->ki_obj.tsk);
return;
}
@@ -694,6 +664,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
spin_lock_irqsave(&ctx->ctx_lock, flags);

+ list_del(&iocb->ki_list); /* remove from active_reqs */
+
/*
* cancelled requests don't get events, userland was given one
* when the event got cancelled.
@@ -740,7 +712,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

put_rq:
/* everything turned out well, dispose of the aiocb. */
- __aio_put_req(ctx, iocb);
+ aio_put_req(iocb);
+ atomic_dec(&ctx->reqs_active);

/*
* We have to order our ring_info tail store above and test
@@ -905,7 +878,7 @@ static int read_events(struct kioctx *ctx,
break;
/* Try to only show up in io wait if there are ops
* in flight */
- if (ctx->reqs_active)
+ if (atomic_read(&ctx->reqs_active))
io_schedule();
else
schedule();
@@ -1364,6 +1337,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return 0;

out_put_req:
+ spin_lock_irq(&ctx->ctx_lock);
+ list_del(&req->ki_list);
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ atomic_dec(&ctx->reqs_active);
+ if (unlikely(!atomic_read(&ctx->reqs_active) && ctx->dead))
+ wake_up_all(&ctx->wait);
+
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7b1eb23..1e728f0 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -49,7 +49,7 @@ struct kioctx;
*/
struct kiocb {
unsigned long ki_flags;
- int ki_users;
+ atomic_t ki_users;
unsigned ki_key; /* id of this request */

struct file *ki_filp;
@@ -96,7 +96,7 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
*kiocb = (struct kiocb) {
- .ki_users = 1,
+ .ki_users = ATOMIC_INIT(1),
.ki_key = KIOCB_SYNC_KEY,
.ki_filp = filp,
.ki_obj.tsk = current,
--
1.8.1.3

2013-03-21 16:46:01

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 09/33] aio: dprintk() -> pr_debug()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 57 ++++++++++++++++++++++++---------------------------------
1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b3b61d1..2637555 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -8,6 +8,8 @@
*
* See ../COPYING for licensing terms.
*/
+#define pr_fmt(fmt) "%s: " fmt, __func__
+
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/errno.h>
@@ -18,8 +20,6 @@
#include <linux/backing-dev.h>
#include <linux/uio.h>

-#define DEBUG 0
-
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -39,12 +39,6 @@
#include <asm/kmap_types.h>
#include <asm/uaccess.h>

-#if DEBUG > 1
-#define dprintk printk
-#else
-#define dprintk(x...) do { ; } while (0)
-#endif
-
#define AIO_RING_MAGIC 0xa10a10a1
#define AIO_RING_COMPAT_FEATURES 1
#define AIO_RING_INCOMPAT_FEATURES 0
@@ -124,7 +118,7 @@ static int __init aio_setup(void)
kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);

- pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct page));
+ pr_debug("sizeof(struct page) = %zu\n", sizeof(struct page));

return 0;
}
@@ -178,7 +172,7 @@ static int aio_setup_ring(struct kioctx *ctx)
}

info->mmap_size = nr_pages * PAGE_SIZE;
- dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
+ pr_debug("attempting mmap of %lu bytes\n", info->mmap_size);
down_write(&mm->mmap_sem);
info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size,
PROT_READ|PROT_WRITE,
@@ -191,7 +185,7 @@ static int aio_setup_ring(struct kioctx *ctx)
return -EAGAIN;
}

- dprintk("mmap address: 0x%08lx\n", info->mmap_base);
+ pr_debug("mmap address: 0x%08lx\n", info->mmap_base);
info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages,
1, 0, info->ring_pages, NULL);
up_write(&mm->mmap_sem);
@@ -265,7 +259,7 @@ static void __put_ioctx(struct kioctx *ctx)
aio_nr -= nr_events;
spin_unlock(&aio_nr_lock);
}
- pr_debug("__put_ioctx: freeing %p\n", ctx);
+ pr_debug("freeing %p\n", ctx);
call_rcu(&ctx->rcu_head, ctx_rcu_free);
}

@@ -354,7 +348,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
hlist_add_head_rcu(&ctx->list, &mm->ioctx_list);
spin_unlock(&mm->ioctx_lock);

- dprintk("aio: allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
+ pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
ctx, ctx->user_id, mm, ctx->ring_info.nr);
return ctx;

@@ -363,7 +357,7 @@ out_cleanup:
aio_free_ring(ctx);
out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
- dprintk("aio: error allocating ioctx %d\n", err);
+ pr_debug("error allocating ioctx %d\n", err);
return ERR_PTR(err);
}

@@ -611,8 +605,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
*/
static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
- dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
- req, atomic_long_read(&req->ki_filp->f_count));
+ pr_debug("(%p): f_count=%ld\n",
+ req, atomic_long_read(&req->ki_filp->f_count));

assert_spin_locked(&ctx->ctx_lock);

@@ -722,9 +716,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
event->res = res;
event->res2 = res2;

- dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
- ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
- res, res2);
+ pr_debug("%p[%lu]: %p: %p %Lx %lx %lx\n",
+ ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+ res, res2);

/* after flagging the request as done, we
* must never even look at it again
@@ -780,9 +774,7 @@ static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
int ret = 0;

ring = kmap_atomic(info->ring_pages[0]);
- dprintk("in aio_read_evt h%lu t%lu m%lu\n",
- (unsigned long)ring->head, (unsigned long)ring->tail,
- (unsigned long)ring->nr);
+ pr_debug("h%u t%u m%u\n", ring->head, ring->tail, ring->nr);

if (ring->head == ring->tail)
goto out;
@@ -803,8 +795,7 @@ static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)

out:
kunmap_atomic(ring);
- dprintk("leaving aio_read_evt: %d h%lu t%lu\n", ret,
- (unsigned long)ring->head, (unsigned long)ring->tail);
+ pr_debug("%d h%u t%u\n", ret, ring->head, ring->tail);
return ret;
}

@@ -867,13 +858,13 @@ static int read_events(struct kioctx *ctx,
if (unlikely(ret <= 0))
break;

- dprintk("read event: %Lx %Lx %Lx %Lx\n",
- ent.data, ent.obj, ent.res, ent.res2);
+ pr_debug("%Lx %Lx %Lx %Lx\n",
+ ent.data, ent.obj, ent.res, ent.res2);

/* Could we split the check in two? */
ret = -EFAULT;
if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
- dprintk("aio: lost an event due to EFAULT.\n");
+ pr_debug("lost an event due to EFAULT.\n");
break;
}
ret = 0;
@@ -936,7 +927,7 @@ static int read_events(struct kioctx *ctx,

ret = -EFAULT;
if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
- dprintk("aio: lost an event due to EFAULT.\n");
+ pr_debug("lost an event due to EFAULT.\n");
break;
}

@@ -967,7 +958,7 @@ static void io_destroy(struct kioctx *ioctx)
hlist_del_rcu(&ioctx->list);
spin_unlock(&mm->ioctx_lock);

- dprintk("aio_release(%p)\n", ioctx);
+ pr_debug("(%p)\n", ioctx);
if (likely(!was_dead))
put_ioctx(ioctx); /* twice for the list */

@@ -1260,7 +1251,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
kiocb->ki_retry = aio_fsync;
break;
default:
- dprintk("EINVAL: io_submit: no operation provided\n");
+ pr_debug("EINVAL: no operation provided\n");
ret = -EINVAL;
}

@@ -1280,7 +1271,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,

/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
- pr_debug("EINVAL: io_submit: reserve field set\n");
+ pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}

@@ -1321,7 +1312,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,

ret = put_user(req->ki_key, &user_iocb->aio_key);
if (unlikely(ret)) {
- dprintk("EFAULT: aio_key\n");
+ pr_debug("EFAULT: aio_key\n");
goto out_put_req;
}

@@ -1402,7 +1393,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,

ctx = lookup_ioctx(ctx_id);
if (unlikely(!ctx)) {
- pr_debug("EINVAL: io_submit: invalid context id\n");
+ pr_debug("EINVAL: invalid context id\n");
return -EINVAL;
}

--
1.8.1.3

2013-03-21 16:46:21

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 08/33] aio: move private stuff out of aio.h

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/usb/gadget/inode.c | 1 +
fs/aio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 61 ----------------------------------------------
3 files changed, 62 insertions(+), 61 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index a1aad43..525cee4 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -25,6 +25,7 @@
#include <linux/slab.h>
#include <linux/poll.h>
#include <linux/mmu_context.h>
+#include <linux/aio.h>

#include <linux/device.h>
#include <linux/moduleparam.h>
diff --git a/fs/aio.c b/fs/aio.c
index d291228..b3b61d1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -45,6 +45,67 @@
#define dprintk(x...) do { ; } while (0)
#endif

+#define AIO_RING_MAGIC 0xa10a10a1
+#define AIO_RING_COMPAT_FEATURES 1
+#define AIO_RING_INCOMPAT_FEATURES 0
+struct aio_ring {
+ unsigned id; /* kernel internal index number */
+ unsigned nr; /* number of io_events */
+ unsigned head;
+ unsigned tail;
+
+ unsigned magic;
+ unsigned compat_features;
+ unsigned incompat_features;
+ unsigned header_length; /* size of aio_ring */
+
+
+ struct io_event io_events[0];
+}; /* 128 bytes + ring size */
+
+#define AIO_RING_PAGES 8
+struct aio_ring_info {
+ unsigned long mmap_base;
+ unsigned long mmap_size;
+
+ struct page **ring_pages;
+ spinlock_t ring_lock;
+ long nr_pages;
+
+ unsigned nr, tail;
+
+ struct page *internal_pages[AIO_RING_PAGES];
+};
+
+static inline unsigned aio_ring_avail(struct aio_ring_info *info,
+ struct aio_ring *ring)
+{
+ return (ring->head + info->nr - 1 - ring->tail) % info->nr;
+}
+
+struct kioctx {
+ atomic_t users;
+ int dead;
+
+ /* This needs improving */
+ unsigned long user_id;
+ struct hlist_node list;
+
+ wait_queue_head_t wait;
+
+ spinlock_t ctx_lock;
+
+ int reqs_active;
+ struct list_head active_reqs; /* used for cancellation */
+
+ /* sys_io_setup currently limits this to an unsigned int */
+ unsigned max_reqs;
+
+ struct aio_ring_info ring_info;
+
+ struct rcu_head rcu_head;
+};
+
/*------ sysctl variables----*/
static DEFINE_SPINLOCK(aio_nr_lock);
unsigned long aio_nr; /* current system wide number of aio requests */
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 615d55a..7b1eb23 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -103,67 +103,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
};
}

-#define AIO_RING_MAGIC 0xa10a10a1
-#define AIO_RING_COMPAT_FEATURES 1
-#define AIO_RING_INCOMPAT_FEATURES 0
-struct aio_ring {
- unsigned id; /* kernel internal index number */
- unsigned nr; /* number of io_events */
- unsigned head;
- unsigned tail;
-
- unsigned magic;
- unsigned compat_features;
- unsigned incompat_features;
- unsigned header_length; /* size of aio_ring */
-
-
- struct io_event io_events[0];
-}; /* 128 bytes + ring size */
-
-#define AIO_RING_PAGES 8
-struct aio_ring_info {
- unsigned long mmap_base;
- unsigned long mmap_size;
-
- struct page **ring_pages;
- spinlock_t ring_lock;
- long nr_pages;
-
- unsigned nr, tail;
-
- struct page *internal_pages[AIO_RING_PAGES];
-};
-
-static inline unsigned aio_ring_avail(struct aio_ring_info *info,
- struct aio_ring *ring)
-{
- return (ring->head + info->nr - 1 - ring->tail) % info->nr;
-}
-
-struct kioctx {
- atomic_t users;
- int dead;
-
- /* This needs improving */
- unsigned long user_id;
- struct hlist_node list;
-
- wait_queue_head_t wait;
-
- spinlock_t ctx_lock;
-
- int reqs_active;
- struct list_head active_reqs; /* used for cancellation */
-
- /* sys_io_setup currently limits this to an unsigned int */
- unsigned max_reqs;
-
- struct aio_ring_info ring_info;
-
- struct rcu_head rcu_head;
-};
-
/* prototypes */
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
--
1.8.1.3

2013-03-21 16:46:46

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 04/33] aio: remove retry-based AIO

From: Zach Brown <[email protected]>

This removes the retry-based AIO infrastructure now that nothing in tree
is using it.

We want to remove retry-based AIO because it is fundemantally unsafe. It
retries IO submission from a kernel thread that has only assumed the mm of
the submitting task. All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task. This
design flaw means that nothing of any meaningful complexity can use
retry-based AIO.

This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking around
the unused run list in the submission path.

This has only been compiled.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Zach Brown <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 348 ++++----------------------------------------------
fs/ocfs2/dlmglue.c | 2 +-
fs/read_write.c | 34 +----
include/linux/aio.h | 22 ----
include/linux/errno.h | 1 -
5 files changed, 29 insertions(+), 378 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 3f941f2..f9a7e6a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -54,11 +54,6 @@ unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio request
static struct kmem_cache *kiocb_cachep;
static struct kmem_cache *kioctx_cachep;

-static struct workqueue_struct *aio_wq;
-
-static void aio_kick_handler(struct work_struct *);
-static void aio_queue_work(struct kioctx *);
-
/* aio_setup
* Creates the slab caches used by the aio routines, panic on
* failure as this is done early during the boot sequence.
@@ -68,9 +63,6 @@ static int __init aio_setup(void)
kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);

- aio_wq = alloc_workqueue("aio", 0, 1); /* used to limit concurrency */
- BUG_ON(!aio_wq);
-
pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct page));

return 0;
@@ -86,7 +78,6 @@ static void aio_free_ring(struct kioctx *ctx)
put_page(info->ring_pages[i]);

if (info->mmap_size) {
- BUG_ON(ctx->mm != current->mm);
vm_munmap(info->mmap_base, info->mmap_size);
}

@@ -101,6 +92,7 @@ static int aio_setup_ring(struct kioctx *ctx)
struct aio_ring *ring;
struct aio_ring_info *info = &ctx->ring_info;
unsigned nr_events = ctx->max_reqs;
+ struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;

@@ -126,23 +118,22 @@ static int aio_setup_ring(struct kioctx *ctx)

info->mmap_size = nr_pages * PAGE_SIZE;
dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
- down_write(&ctx->mm->mmap_sem);
+ down_write(&mm->mmap_sem);
info->mmap_base = do_mmap_pgoff(NULL, 0, info->mmap_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, 0,
&populate);
if (IS_ERR((void *)info->mmap_base)) {
- up_write(&ctx->mm->mmap_sem);
+ up_write(&mm->mmap_sem);
info->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}

dprintk("mmap address: 0x%08lx\n", info->mmap_base);
- info->nr_pages = get_user_pages(current, ctx->mm,
- info->mmap_base, nr_pages,
+ info->nr_pages = get_user_pages(current, mm, info->mmap_base, nr_pages,
1, 0, info->ring_pages, NULL);
- up_write(&ctx->mm->mmap_sem);
+ up_write(&mm->mmap_sem);

if (unlikely(info->nr_pages != nr_pages)) {
aio_free_ring(ctx);
@@ -206,10 +197,7 @@ static void __put_ioctx(struct kioctx *ctx)
unsigned nr_events = ctx->max_reqs;
BUG_ON(ctx->reqs_active);

- cancel_delayed_work_sync(&ctx->wq);
aio_free_ring(ctx);
- mmdrop(ctx->mm);
- ctx->mm = NULL;
if (nr_events) {
spin_lock(&aio_nr_lock);
BUG_ON(aio_nr - nr_events > aio_nr);
@@ -237,7 +225,7 @@ static inline void put_ioctx(struct kioctx *kioctx)
*/
static struct kioctx *ioctx_alloc(unsigned nr_events)
{
- struct mm_struct *mm;
+ struct mm_struct *mm = current->mm;
struct kioctx *ctx;
int err = -ENOMEM;

@@ -256,8 +244,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-ENOMEM);

ctx->max_reqs = nr_events;
- mm = ctx->mm = current->mm;
- atomic_inc(&mm->mm_count);

atomic_set(&ctx->users, 2);
spin_lock_init(&ctx->ctx_lock);
@@ -265,8 +251,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
init_waitqueue_head(&ctx->wait);

INIT_LIST_HEAD(&ctx->active_reqs);
- INIT_LIST_HEAD(&ctx->run_list);
- INIT_DELAYED_WORK(&ctx->wq, aio_kick_handler);

if (aio_setup_ring(ctx) < 0)
goto out_freectx;
@@ -287,14 +271,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
spin_unlock(&mm->ioctx_lock);

dprintk("aio: allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
- ctx, ctx->user_id, current->mm, ctx->ring_info.nr);
+ ctx, ctx->user_id, mm, ctx->ring_info.nr);
return ctx;

out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
out_freectx:
- mmdrop(mm);
kmem_cache_free(kioctx_cachep, ctx);
dprintk("aio: error allocating ioctx %d\n", err);
return ERR_PTR(err);
@@ -391,8 +374,6 @@ void exit_aio(struct mm_struct *mm)
* as indicator that it needs to unmap the area,
* just set it to 0; aio_free_ring() is the only
* place that uses ->mmap_size, so it's safe.
- * That way we get all munmap done to current->mm -
- * all other callers have ctx->mm == current->mm.
*/
ctx->ring_info.mmap_size = 0;
put_ioctx(ctx);
@@ -426,7 +407,6 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
req->ki_dtor = NULL;
req->private = NULL;
req->ki_iovec = NULL;
- INIT_LIST_HEAD(&req->ki_run_list);
req->ki_eventfd = NULL;

return req;
@@ -611,281 +591,6 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
return ret;
}

-/*
- * Queue up a kiocb to be retried. Assumes that the kiocb
- * has already been marked as kicked, and places it on
- * the retry run list for the corresponding ioctx, if it
- * isn't already queued. Returns 1 if it actually queued
- * the kiocb (to tell the caller to activate the work
- * queue to process it), or 0, if it found that it was
- * already queued.
- */
-static inline int __queue_kicked_iocb(struct kiocb *iocb)
-{
- struct kioctx *ctx = iocb->ki_ctx;
-
- assert_spin_locked(&ctx->ctx_lock);
-
- if (list_empty(&iocb->ki_run_list)) {
- list_add_tail(&iocb->ki_run_list,
- &ctx->run_list);
- return 1;
- }
- return 0;
-}
-
-/* aio_run_iocb
- * This is the core aio execution routine. It is
- * invoked both for initial i/o submission and
- * subsequent retries via the aio_kick_handler.
- * Expects to be invoked with iocb->ki_ctx->lock
- * already held. The lock is released and reacquired
- * as needed during processing.
- *
- * Calls the iocb retry method (already setup for the
- * iocb on initial submission) for operation specific
- * handling, but takes care of most of common retry
- * execution details for a given iocb. The retry method
- * needs to be non-blocking as far as possible, to avoid
- * holding up other iocbs waiting to be serviced by the
- * retry kernel thread.
- *
- * The trickier parts in this code have to do with
- * ensuring that only one retry instance is in progress
- * for a given iocb at any time. Providing that guarantee
- * simplifies the coding of individual aio operations as
- * it avoids various potential races.
- */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
-{
- struct kioctx *ctx = iocb->ki_ctx;
- ssize_t (*retry)(struct kiocb *);
- ssize_t ret;
-
- if (!(retry = iocb->ki_retry)) {
- printk("aio_run_iocb: iocb->ki_retry = NULL\n");
- return 0;
- }
-
- /*
- * We don't want the next retry iteration for this
- * operation to start until this one has returned and
- * updated the iocb state. However, wait_queue functions
- * can trigger a kick_iocb from interrupt context in the
- * meantime, indicating that data is available for the next
- * iteration. We want to remember that and enable the
- * next retry iteration _after_ we are through with
- * this one.
- *
- * So, in order to be able to register a "kick", but
- * prevent it from being queued now, we clear the kick
- * flag, but make the kick code *think* that the iocb is
- * still on the run list until we are actually done.
- * When we are done with this iteration, we check if
- * the iocb was kicked in the meantime and if so, queue
- * it up afresh.
- */
-
- kiocbClearKicked(iocb);
-
- /*
- * This is so that aio_complete knows it doesn't need to
- * pull the iocb off the run list (We can't just call
- * INIT_LIST_HEAD because we don't want a kick_iocb to
- * queue this on the run list yet)
- */
- iocb->ki_run_list.next = iocb->ki_run_list.prev = NULL;
- spin_unlock_irq(&ctx->ctx_lock);
-
- /* Quit retrying if the i/o has been cancelled */
- if (kiocbIsCancelled(iocb)) {
- ret = -EINTR;
- aio_complete(iocb, ret, 0);
- /* must not access the iocb after this */
- goto out;
- }
-
- /*
- * Now we are all set to call the retry method in async
- * context.
- */
- ret = retry(iocb);
-
- if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
- /*
- * There's no easy way to restart the syscall since other AIO's
- * may be already running. Just fail this IO with EINTR.
- */
- if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
- ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
- ret = -EINTR;
- aio_complete(iocb, ret, 0);
- }
-out:
- spin_lock_irq(&ctx->ctx_lock);
-
- if (-EIOCBRETRY == ret) {
- /*
- * OK, now that we are done with this iteration
- * and know that there is more left to go,
- * this is where we let go so that a subsequent
- * "kick" can start the next iteration
- */
-
- /* will make __queue_kicked_iocb succeed from here on */
- INIT_LIST_HEAD(&iocb->ki_run_list);
- /* we must queue the next iteration ourselves, if it
- * has already been kicked */
- if (kiocbIsKicked(iocb)) {
- __queue_kicked_iocb(iocb);
-
- /*
- * __queue_kicked_iocb will always return 1 here, because
- * iocb->ki_run_list is empty at this point so it should
- * be safe to unconditionally queue the context into the
- * work queue.
- */
- aio_queue_work(ctx);
- }
- }
- return ret;
-}
-
-/*
- * __aio_run_iocbs:
- * Process all pending retries queued on the ioctx
- * run list.
- * Assumes it is operating within the aio issuer's mm
- * context.
- */
-static int __aio_run_iocbs(struct kioctx *ctx)
-{
- struct kiocb *iocb;
- struct list_head run_list;
-
- assert_spin_locked(&ctx->ctx_lock);
-
- list_replace_init(&ctx->run_list, &run_list);
- while (!list_empty(&run_list)) {
- iocb = list_entry(run_list.next, struct kiocb,
- ki_run_list);
- list_del(&iocb->ki_run_list);
- /*
- * Hold an extra reference while retrying i/o.
- */
- iocb->ki_users++; /* grab extra reference */
- aio_run_iocb(iocb);
- __aio_put_req(ctx, iocb);
- }
- if (!list_empty(&ctx->run_list))
- return 1;
- return 0;
-}
-
-static void aio_queue_work(struct kioctx * ctx)
-{
- unsigned long timeout;
- /*
- * if someone is waiting, get the work started right
- * away, otherwise, use a longer delay
- */
- smp_mb();
- if (waitqueue_active(&ctx->wait))
- timeout = 1;
- else
- timeout = HZ/10;
- queue_delayed_work(aio_wq, &ctx->wq, timeout);
-}
-
-/*
- * aio_run_all_iocbs:
- * Process all pending retries queued on the ioctx
- * run list, and keep running them until the list
- * stays empty.
- * Assumes it is operating within the aio issuer's mm context.
- */
-static inline void aio_run_all_iocbs(struct kioctx *ctx)
-{
- spin_lock_irq(&ctx->ctx_lock);
- while (__aio_run_iocbs(ctx))
- ;
- spin_unlock_irq(&ctx->ctx_lock);
-}
-
-/*
- * aio_kick_handler:
- * Work queue handler triggered to process pending
- * retries on an ioctx. Takes on the aio issuer's
- * mm context before running the iocbs, so that
- * copy_xxx_user operates on the issuer's address
- * space.
- * Run on aiod's context.
- */
-static void aio_kick_handler(struct work_struct *work)
-{
- struct kioctx *ctx = container_of(work, struct kioctx, wq.work);
- mm_segment_t oldfs = get_fs();
- struct mm_struct *mm;
- int requeue;
-
- set_fs(USER_DS);
- use_mm(ctx->mm);
- spin_lock_irq(&ctx->ctx_lock);
- requeue =__aio_run_iocbs(ctx);
- mm = ctx->mm;
- spin_unlock_irq(&ctx->ctx_lock);
- unuse_mm(mm);
- set_fs(oldfs);
- /*
- * we're in a worker thread already; no point using non-zero delay
- */
- if (requeue)
- queue_delayed_work(aio_wq, &ctx->wq, 0);
-}
-
-
-/*
- * Called by kick_iocb to queue the kiocb for retry
- * and if required activate the aio work queue to process
- * it
- */
-static void try_queue_kicked_iocb(struct kiocb *iocb)
-{
- struct kioctx *ctx = iocb->ki_ctx;
- unsigned long flags;
- int run = 0;
-
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- /* set this inside the lock so that we can't race with aio_run_iocb()
- * testing it and putting the iocb on the run list under the lock */
- if (!kiocbTryKick(iocb))
- run = __queue_kicked_iocb(iocb);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- if (run)
- aio_queue_work(ctx);
-}
-
-/*
- * kick_iocb:
- * Called typically from a wait queue callback context
- * to trigger a retry of the iocb.
- * The retry is usually executed by aio workqueue
- * threads (See aio_kick_handler).
- */
-void kick_iocb(struct kiocb *iocb)
-{
- /* sync iocbs are easy: they can only ever be executing from a
- * single context. */
- if (is_sync_kiocb(iocb)) {
- kiocbSetKicked(iocb);
- wake_up_process(iocb->ki_obj.tsk);
- return;
- }
-
- try_queue_kicked_iocb(iocb);
-}
-EXPORT_SYMBOL(kick_iocb);
-
/* aio_complete
* Called when the io request on the given iocb is complete.
* Returns true if this is the last user of the request. The
@@ -926,9 +631,6 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
*/
spin_lock_irqsave(&ctx->ctx_lock, flags);

- if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
- list_del_init(&iocb->ki_run_list);
-
/*
* cancelled requests don't get events, userland was given one
* when the event got cancelled.
@@ -1083,13 +785,11 @@ static int read_events(struct kioctx *ctx,
int i = 0;
struct io_event ent;
struct aio_timeout to;
- int retry = 0;

/* needed to zero any padding within an entry (there shouldn't be
* any, but C is fun!
*/
memset(&ent, 0, sizeof(ent));
-retry:
ret = 0;
while (likely(i < nr)) {
ret = aio_read_evt(ctx, &ent);
@@ -1119,13 +819,6 @@ retry:

/* End fast path */

- /* racey check, but it gets redone */
- if (!retry && unlikely(!list_empty(&ctx->run_list))) {
- retry = 1;
- aio_run_all_iocbs(ctx);
- goto retry;
- }
-
init_timeout(&to);
if (timeout) {
struct timespec ts;
@@ -1345,7 +1038,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
/* If we managed to write some out we return that, rather than
* the eventual error. */
if (opcode == IOCB_CMD_PWRITEV
- && ret < 0 && ret != -EIOCBQUEUED && ret != -EIOCBRETRY
+ && ret < 0 && ret != -EIOCBQUEUED
&& iocb->ki_nbytes - iocb->ki_left)
ret = iocb->ki_nbytes - iocb->ki_left;

@@ -1587,18 +1280,27 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
* don't see ctx->dead set here, io_destroy() waits for our IO to
* finish.
*/
- if (ctx->dead) {
- spin_unlock_irq(&ctx->ctx_lock);
+ if (ctx->dead)
ret = -EINVAL;
+ spin_unlock_irq(&ctx->ctx_lock);
+ if (ret)
goto out_put_req;
+
+ if (unlikely(kiocbIsCancelled(req))) {
+ ret = -EINTR;
+ } else {
+ ret = req->ki_retry(req);
}
- aio_run_iocb(req);
- if (!list_empty(&ctx->run_list)) {
- /* drain the run list */
- while (__aio_run_iocbs(ctx))
- ;
+ if (ret != -EIOCBQUEUED) {
+ /*
+ * There's no easy way to restart the syscall since other AIO's
+ * may be already running. Just fail this IO with EINTR.
+ */
+ if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+ ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
+ ret = -EINTR;
+ aio_complete(req, ret, 0);
}
- spin_unlock_irq(&ctx->ctx_lock);

aio_put_req(req); /* drop extra ref to req */
return 0;
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 12ae194..3a44a64 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2322,7 +2322,7 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
status = __ocfs2_cluster_lock(osb, lockres, level, dlm_flags,
arg_flags, subclass, _RET_IP_);
if (status < 0) {
- if (status != -EAGAIN && status != -EIOCBRETRY)
+ if (status != -EAGAIN)
mlog_errno(status);
goto bail;
}
diff --git a/fs/read_write.c b/fs/read_write.c
index a698eff..0dabcf7 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -325,16 +325,6 @@ int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count
return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
}

-static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
-{
- set_current_state(TASK_UNINTERRUPTIBLE);
- if (!kiocbIsKicked(iocb))
- schedule();
- else
- kiocbClearKicked(iocb);
- __set_current_state(TASK_RUNNING);
-}
-
ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -346,13 +336,7 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp
kiocb.ki_left = len;
kiocb.ki_nbytes = len;

- for (;;) {
- ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
- if (ret != -EIOCBRETRY)
- break;
- wait_on_retry_sync_kiocb(&kiocb);
- }
-
+ ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
@@ -402,13 +386,7 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof
kiocb.ki_left = len;
kiocb.ki_nbytes = len;

- for (;;) {
- ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
- if (ret != -EIOCBRETRY)
- break;
- wait_on_retry_sync_kiocb(&kiocb);
- }
-
+ ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
@@ -575,13 +553,7 @@ ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,
kiocb.ki_left = len;
kiocb.ki_nbytes = len;

- for (;;) {
- ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos);
- if (ret != -EIOCBRETRY)
- break;
- wait_on_retry_sync_kiocb(&kiocb);
- }
-
+ ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos);
if (ret == -EIOCBQUEUED)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b46a09f..019204e 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -14,18 +14,12 @@ struct kioctx;
#define KIOCB_SYNC_KEY (~0U)

/* ki_flags bits */
-#define KIF_KICKED 1
#define KIF_CANCELLED 2

-#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
-
-#define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)

-#define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)

-#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)

/* is there a better place to document function pointer methods? */
@@ -52,18 +46,8 @@ struct kioctx;
* not ask the method again -- ki_retry must ensure forward progress.
* aio_complete() must be called once and only once in the future, multiple
* calls may result in undefined behaviour.
- *
- * If ki_retry returns -EIOCBRETRY it has made a promise that kick_iocb()
- * will be called on the kiocb pointer in the future. This may happen
- * through generic helpers that associate kiocb->ki_wait with a wait
- * queue head that ki_retry uses via current->io_wait. It can also happen
- * with custom tracking and manual calls to kick_iocb(), though that is
- * discouraged. In either case, kick_iocb() must be called once and only
- * once. ki_retry must ensure forward progress, the AIO core will wait
- * indefinitely for kick_iocb() to be called.
*/
struct kiocb {
- struct list_head ki_run_list;
unsigned long ki_flags;
int ki_users;
unsigned ki_key; /* id of this request */
@@ -160,7 +144,6 @@ static inline unsigned aio_ring_avail(struct aio_ring_info *info,
struct kioctx {
atomic_t users;
int dead;
- struct mm_struct *mm;

/* This needs improving */
unsigned long user_id;
@@ -172,15 +155,12 @@ struct kioctx {

int reqs_active;
struct list_head active_reqs; /* used for cancellation */
- struct list_head run_list; /* used for kicked reqs */

/* sys_io_setup currently limits this to an unsigned int */
unsigned max_reqs;

struct aio_ring_info ring_info;

- struct delayed_work wq;
-
struct rcu_head rcu_head;
};

@@ -188,7 +168,6 @@ struct kioctx {
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
extern int aio_put_req(struct kiocb *iocb);
-extern void kick_iocb(struct kiocb *iocb);
extern int aio_complete(struct kiocb *iocb, long res, long res2);
struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
@@ -197,7 +176,6 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-static inline void kick_iocb(struct kiocb *iocb) { }
static inline int aio_complete(struct kiocb *iocb, long res, long res2) { return 0; }
struct mm_struct;
static inline void exit_aio(struct mm_struct *mm) { }
diff --git a/include/linux/errno.h b/include/linux/errno.h
index f6bf082..89627b9 100644
--- a/include/linux/errno.h
+++ b/include/linux/errno.h
@@ -28,6 +28,5 @@
#define EBADTYPE 527 /* Type not supported by server */
#define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */
#define EIOCBQUEUED 529 /* iocb queued, will get completion event */
-#define EIOCBRETRY 530 /* iocb queued, will trigger a retry */

#endif
--
1.8.1.3

2013-03-21 16:47:14

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 01/33] mm: remove old aio use_mm() comment

From: Zach Brown <[email protected]>

use_mm() is used in more places than just aio. There's no need to mention
callers when describing the function.

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/mmu_context.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..8a8cd02 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -14,9 +14,6 @@
* use_mm
* Makes the calling kernel thread take on the specified
* mm context.
- * Called by the retry thread execute retries within the
- * iocb issuer's mm context, so that copy_from/to_user
- * operations work seamlessly for aio.
* (Note: this routine is intended to be called only
* from a kernel thread context)
*/
--
1.8.1.3

2013-03-21 17:38:33

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 09/33] aio: dprintk() -> pr_debug()

On Thu, 2013-03-21 at 09:35 -0700, Kent Overstreet wrote:
[]
> diff --git a/fs/aio.c b/fs/aio.c

Hi Kent.

I generally prefer pr_debug but maybe here's
a couple of things you don't already know.

> +#define pr_fmt(fmt) "%s: " fmt, __func__

dynamic debug can add __func__ to each output
with +f so I think this prefixing with %s, __func__
is unnecessary.

I do think
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
is fairly standard, (though that can be added by
dynamic_debug as well with -m)

For example, without dynamic debug enabled
but with DEBUG #defined and KBUILD_MODNAME
I could get in dmesg:
aio: ENOMEM: nr_events too high
instead of
ioctx_alloc: ENOMEM: nr_events too high
and I think that's more intelligible.

These messages are not emitted by default, only
when specifically enabled with dynamic_debug or
adding #define DEBUG to the sources or -DDEBUG
to the Makefile

2013-03-28 14:56:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 05/33] char: add aio_{read,write} to /dev/{null,zero}

On Thu, Mar 21, 2013 at 09:35:26AM -0700, Kent Overstreet wrote:
> From: Zach Brown <[email protected]>
>
> These are handy for measuring the cost of the aio infrastructure with
> operations that do very little and complete immediately.
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-28 14:56:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 06/33] aio: kill return value of aio_complete()

On Thu, Mar 21, 2013 at 09:35:27AM -0700, Kent Overstreet wrote:
> Nothing used the return value, and it probably wasn't possible to use it
> safely for the locked versions (aio_complete(), aio_put_req()). Just kill
> it.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Acked-by: Zach Brown <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-28 14:56:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 07/33] aio: add kiocb_cancel()

On Thu, Mar 21, 2013 at 09:35:28AM -0700, Kent Overstreet wrote:
> Minor refactoring, to get rid of some duplicated code
>
> [[email protected]: fix warning]
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-28 14:56:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 01/33] mm: remove old aio use_mm() comment

On Thu, Mar 21, 2013 at 09:35:22AM -0700, Kent Overstreet wrote:
> From: Zach Brown <[email protected]>
>
> use_mm() is used in more places than just aio. There's no need to mention
> callers when describing the function.
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-28 14:57:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 02/33] aio: remove dead code from aio.h

On Thu, Mar 21, 2013 at 09:35:23AM -0700, Kent Overstreet wrote:
> From: Zach Brown <[email protected]>
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-28 14:56:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 04/33] aio: remove retry-based AIO

On Thu, Mar 21, 2013 at 09:35:25AM -0700, Kent Overstreet wrote:
> From: Zach Brown <[email protected]>
>
> This removes the retry-based AIO infrastructure now that nothing in tree
> is using it.
>
> We want to remove retry-based AIO because it is fundemantally unsafe. It
> retries IO submission from a kernel thread that has only assumed the mm of
> the submitting task. All other task_struct references in the IO
> submission path will see the kernel thread, not the submitting task. This
> design flaw means that nothing of any meaningful complexity can use
> retry-based AIO.
>
> This removes all the code and data associated with the retry machinery.
> The most significant benefit of this is the removal of the locking around
> the unused run list in the submission path.
>
> This has only been compiled.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Zach Brown <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-29 18:20:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 11/33] aio: make aio_put_req() lockless

On Thu, Mar 21, 2013 at 09:35:32AM -0700, Kent Overstreet wrote:
> Freeing a kiocb needed to touch the kioctx for three things:
>
> * Pull it off the reqs_active list
> * Decrementing reqs_active
> * Issuing a wakeup, if the kioctx was in the process of being freed.
>
> This patch moves these to aio_complete(), for a couple reasons:
>
> * aio_complete() already has to issue the wakeup, so if we drop the
> kioctx refcount before aio_complete does its wakeup we don't have to
> do it twice.
> * aio_complete currently has to take the kioctx lock, so it makes sense
> for it to pull the kiocb off the reqs_active list too.
> * A later patch is going to change reqs_active to include unreaped
> completions - this will mean allocating a kiocb doesn't have to look
> at the ringbuffer. So taking the decrement of reqs_active out of
> kiocb_free() is useful prep work for that patch.
>
> This doesn't really affect cancellation, since existing (usb) code that
> implements a cancel function still calls aio_complete() - we just have
> to make sure that aio_complete does the necessary teardown for cancelled
> kiocbs.
>
> It does affect code paths where we free kiocbs that were never
> submitted; they need to decrement reqs_active and pull the kiocb off the
> reqs_active list. This occurs in two places: kiocb_batch_free(), which
> is going away in a later patch, and the error path in io_submit_one.
>
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-29 18:20:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 10/33] aio: do fget() after aio_get_req()

On Thu, Mar 21, 2013 at 09:35:31AM -0700, Kent Overstreet wrote:
> aio_get_req() will fail if we have the maximum number of requests
> outstanding, which depending on the application may not be uncommon. So
> avoid doing an unnecessary fget().
>
> Signed-off-by: Kent Overstreet <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-03-29 18:20:55

by Theodore Ts'o

[permalink] [raw]

2013-04-02 01:32:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 12/33] aio: refcounting cleanup

On Thu, Mar 21, 2013 at 09:35:33AM -0700, Kent Overstreet wrote:
> The usage of ctx->dead was fubar - it makes no sense to explicitly check
> it all over the place, especially when we're already using RCU.
>
> Now, ctx->dead only indicates whether we've dropped the initial
> refcount. The new teardown sequence is:
> set ctx->dead
> hlist_del_rcu();
> synchronize_rcu();
>
> Now we know no system calls can take a new ref, and it's safe to drop
> the initial ref:
> put_ioctx();
>
> We also need to ensure there are no more outstanding kiocbs. This was
> done incorrectly - it was being done in kill_ctx(), and before dropping
> the initial refcount. At this point, other syscalls may still be
> submitting kiocbs!
>
> Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
> kioctx->users has dropped to 0 and we know no more iocbs could be
> submitted.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 01:44:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 13/33] wait: add wait_event_hrtimeout()

On Thu, Mar 21, 2013 at 09:35:34AM -0700, Kent Overstreet wrote:
> Analagous to wait_event_timeout() and friends, this adds
> wait_event_hrtimeout() and wait_event_interruptible_hrtimeout().
>
> Note that unlike the versions that use regular timers, these don't return
> the amount of time remaining when they return - instead, they return 0 or
> -ETIME if they timed out. because I was uncomfortable with the semantics
> of doing it the other way (that I could get it right, anyways).
>
> If the timer expires, there's no real guarantee that expire_time -
> current_time would be <= 0 - due to timer slack certainly, and I'm not
> sure I want to know the implications of the different clock bases in
> hrtimers.
>
> If the timer does expire and the code calculates that the time remaining
> is nonnegative, that could be even worse if the calling code then reuses
> that timeout. Probably safer to just return 0 then, but I could imagine
> weird bugs or at least unintended behaviour arising from that too.
>
> I came to the conclusion that if other users end up actually needing the
> amount of time remaining, the sanest thing to do would be to create a
> version that uses absolute timeouts instead of relative.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 01:58:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 14/33] aio: make aio_read_evt() more efficient, convert to hrtimers

On Thu, Mar 21, 2013 at 09:35:35AM -0700, Kent Overstreet wrote:
> Previously, aio_read_event() pulled a single completion off the ringbuffer
> at a time, locking and unlocking each time. Change it to pull off as many
> events as it can at a time, and copy them directly to userspace.
>
> This also fixes a bug where if copying the event to userspace failed,
> we'd lose the event.
>
> Also convert it to wait_event_interruptible_hrtimeout(), which
> simplifies it quite a bit.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 02:12:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 15/33] aio: use flush_dcache_page()

The commit description needs to explain why flush_dcache_page() is
needed now, but wasn't needed before.

- Ted

2013-04-02 02:36:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 16/33] aio: use cancellation list lazily

On Thu, Mar 21, 2013 at 09:35:37AM -0700, Kent Overstreet wrote:
> Cancelling kiocbs requires adding them to a per kioctx linked list, which
> is one of the few things we need to take the kioctx lock for in the fast
> path. But most kiocbs can't be cancelled - so if we just do this lazily,
> we can avoid quite a bit of locking overhead.
>
> While we're at it, instead of using a flag bit switch to using ki_cancel
> itself to indicate that a kiocb has been cancelled/completed. This lets
> us get rid of ki_flags entirely.


Reviewed-by: "Theodore Ts'o" <[email protected]>

One nit....

> + * And since most things don't implement kiocb cancellation and we'd really like
> + * kiocb completion to be lockless when possible, we use ki_cancel to
> + * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
> + * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().

It's not batch_complete_aio() until later in the patch series.... as
of this commit, it's still aio_complete()

- Ted

2013-04-02 02:54:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 17/33] aio: change reqs_active to include unreaped completions

On Thu, Mar 21, 2013 at 09:35:38AM -0700, Kent Overstreet wrote:
> The aio code tries really hard to avoid having to deal with the completion
> ringbuffer overflowing. To do that, it has to keep track of the number of
> outstanding kiocbs, and the number of completions currently in the
> ringbuffer - and it's got to check that every time we allocate a kiocb.
> Ouch.
>
> But - we can improve this quite a bit if we just change reqs_active to
> mean "number of outstanding requests and unreaped completions" - that
> means kiocb allocation doesn't have to look at the ringbuffer, which is a
> fairly significant win.

Signed-off-by: "Theodore Ts'o" <[email protected]>

Could you please add a quick comment documenting the reqs_active field
in the struct kioctx definition here? For future code
maintainability, it should be documented in fs/aio.c, not just in a
commit description.

> struct kioctx {
> atomic_t users;
> atomic_t dead;
> @@ -92,7 +86,13 @@ struct kioctx {
> atomic_t reqs_active;
> struct list_head active_reqs; /* used for cancellation */

- Ted

2013-04-02 03:03:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 18/33] aio: kill batch allocation

On Thu, Mar 21, 2013 at 09:35:39AM -0700, Kent Overstreet wrote:
> Previously, allocating a kiocb required touching quite a few global (well,
> per kioctx) cachelines... so batching up allocation to amortize those was
> worthwhile. But we've gotten rid of some of those, and in another couple
> of patches kiocb allocation won't require writing to any shared
> cachelines, so that means we can just rip this code out.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 03:27:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 19/33] aio: kill struct aio_ring_info

On Thu, Mar 21, 2013 at 09:35:40AM -0700, Kent Overstreet wrote:
> struct aio_ring_info was kind of odd, the only place it's used is where
> it's embedded in struct kioctx - there's no real need for it.
>
> The next patch rearranges struct kioctx and puts various things on their
> own cachelines - getting rid of struct aio_ring_info now makes that
> reordering a bit clearer.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 15:47:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 17/33] aio: change reqs_active to include unreaped completions

On Mon, Apr 01, 2013 at 10:53:50PM -0400, Theodore Ts'o wrote:
> Could you please add a quick comment documenting the reqs_active field
> in the struct kioctx definition here? For future code
> maintainability, it should be documented in fs/aio.c, not just in a
> commit description.

I see this field gets changed yet again in "aio: reqs_active ->
reqs_available" and it's documented appropritaely after that commit....

- Ted

2013-04-02 15:48:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 21/33] aio: reqs_active -> reqs_available

On Thu, Mar 21, 2013 at 09:35:42AM -0700, Kent Overstreet wrote:
> The number of outstanding kiocbs is one of the few shared things left that
> has to be touched for every kiocb - it'd be nice to make it percpu.
>
> We can make it per cpu by treating it like an allocation problem: we have
> a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
> just allocate and free slots, and we know how to write per cpu allocators.
>
> So as prep work for that, we convert reqs_active to reqs_available.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 16:03:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 22/33] aio: percpu reqs_available

On Thu, Mar 21, 2013 at 09:35:43AM -0700, Kent Overstreet wrote:
> See the previous patch ("aio: reqs_active -> reqs_available") for why we
> want to do this - this basically implements a per cpu allocator for
> reqs_available that doesn't actually allocate anything.
>
> Note that we need to increase the size of the ringbuffer we allocate,
> since a single thread won't necessarily be able to use all the
> reqs_available slots - some (up to about half) might be on other per cpu
> lists, unavailable for the current thread.
>
> We size the ringbuffer based on the nr_events userspace passed to
> io_setup(), so this is a slight behaviour change - but nr_events wasn't
> being used as a hard limit before, it was being rounded up to the next
> page before so this doesn't change the actual semantics.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 16:27:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 23/33] generic dynamic per cpu refcounting

Reviewed-by: "Theodore Ts'o" <[email protected]>

> + v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS),
> + &ref->count);
> +
> + if (!(v >> PCPU_COUNT_BITS) &&
> + REF_STATUS(pcpu_count) == PCPU_REF_NONE && alloc)
> + percpu_ref_alloc(ref, pcpu_count);

This assumes that the kernel is compiled with -fno-strict-overflow.
Which we do, and this is not the only place int the kernel where we
depend on this, so while I was nervous before, I'm okay with it now.
Could we at least have a comment saying that we're depending on
-fno-strict-overflow, though?

2013-04-02 16:29:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 24/33] aio: percpu ioctx refcount

On Thu, Mar 21, 2013 at 09:35:45AM -0700, Kent Overstreet wrote:
> This just converts the ioctx refcount to the new generic dynamic percpu
> refcount code.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 16:35:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 25/33] aio: use xchg() instead of completion_lock

On Thu, Mar 21, 2013 at 09:35:46AM -0700, Kent Overstreet wrote:
> So, for sticking kiocb completions on the kioctx ringbuffer, we need a
> lock - it unfortunately can't be lockless.
>
> When the kioctx is shared between threads on different cpus and the rate
> of completions is high, this lock sees quite a bit of contention - in
> terms of cacheline contention it's the hottest thing in the aio subsystem.
>
> That means, with a regular spinlock, we're going to take a cache miss to
> grab the lock, then another cache miss when we touch the data the lock
> protects - if it's on the same cacheline as the lock, other cpus spinning
> on the lock are going to be pulling it out from under us as we're using
> it.
>
> So, we use an old trick to get rid of this second forced cache miss - make
> the data the lock protects be the lock itself, so we grab them both at
> once.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 16:35:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 26/33] aio: don't include aio.h in sched.h

On Thu, Mar 21, 2013 at 09:35:47AM -0700, Kent Overstreet wrote:
> Faster kernel compiles by way of fewer unnecessary includes.

Signed-off-by: "Theodore Ts'o" <[email protected]>

2013-04-02 16:36:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 27/33] aio: kill ki_key

On Thu, Mar 21, 2013 at 09:35:48AM -0700, Kent Overstreet wrote:
> ki_key wasn't actually used for anything previously - it was always 0.
> Drop it to trim struct kiocb a bit.

Signed-off-by: "Theodore Ts'o" <[email protected]>

2013-04-02 18:47:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 28/33] aio: kill ki_retry

On Thu, Mar 21, 2013 at 09:35:49AM -0700, Kent Overstreet wrote:
> Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
> need this anymore - ki_retry was only called right after the kiocb was
> initialized.
>
> This also refactors and trims some duplicated code, as well as cleaning up
> the refcounting/error handling a bit.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 18:48:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 29/33] block: Prep work for batch completion

On Thu, Mar 21, 2013 at 09:35:50AM -0700, Kent Overstreet wrote:
> Add a struct batch_complete * argument to bi_end_io; infrastructure to
> make use of it comes in the next patch.

Reviewed-by: "Theodore Ts'o" <[email protected]>

2013-04-02 19:49:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 30/33] block, aio: batch completion for bios/kiocbs

On Thu, Mar 21, 2013 at 09:35:51AM -0700, Kent Overstreet wrote:
> + if (unlikely(req->ki_ctx != ctx)) {
> + kioctx_ring_unlock(ctx, tail);
> +
> + ctx = req->ki_ctx;
> + tail = kioctx_ring_lock(ctx);
> + }

The only place where you're calling kioctx_ring_lock() is above, which
is part of an unlock/lock pair.

There is also a kioctx_ring_unlock at the end of batch_complete_aio():

> + kioctx_ring_unlock(ctx, tail);
> + local_irq_restore(flags);
> + rcu_read_unlock();

But I'm not seeing a matching kioctx_ring_lock() before the while loop
in batch_complete_aio(), nor anywhere else in the file. And since
kioctx_ring_lock() is a static function....

Am I missing something?

- Ted

2013-04-02 19:53:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 30/33] block, aio: batch completion for bios/kiocbs

On Thu, Mar 21, 2013 at 09:35:51AM -0700, Kent Overstreet wrote:
> + if (unlikely(req->ki_eventfd != eventfd)) {
> + if (eventfd) {
> + /* Make event visible */
> + kioctx_ring_unlock(ctx, tail);
> + ctx = NULL;
> +
> + eventfd_signal(eventfd, 1);
> + eventfd_ctx_put(eventfd);
> + }

I just noticed something else. There's a ring unlock here().... but
there isn't a matching ring_lock(), or an exit from the function.
Since you've set the ctx to be NULL, then subsequently, aren't we
going to crash at the subseqent kioctx_ring_unlock() below....

> +
> + eventfd = req->ki_eventfd;
> + req->ki_eventfd = NULL;
> + }
> +
> + if (unlikely(req->ki_ctx != ctx)) {
> + kioctx_ring_unlock(ctx, tail);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(Or the kioctx_ring_unlock() at the end of this function after the
while loop terminates.)

- Ted

2013-04-02 21:35:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 33/33] aio: fix kioctx not being freed after cancellation at exit time

On Thu, Mar 21, 2013 at 09:35:54AM -0700, Kent Overstreet wrote:
> From: Benjamin LaHaise <[email protected]>
>
> The recent changes overhauling fs/aio.c introduced a bug that results in the
> kioctx not being freed when outstanding kiocbs are cancelled at exit_aio()
> time. Specifically, a kiocb that is cancelled has its completion events
> discarded by batch_complete_aio(), which then fails to wake up the process
> stuck in free_ioctx(). Fix this by removing the event suppression in
> batch_complete_aio() and modify the wait_event() condition in free_ioctx()
> appropriately.

Once you remove the event suppression, then it means that every single
cancelled AIO will result in ki_ctx->reqs_available getting double
incremented, right? But reqs_available gets used in more places than
just free_ioctx(). It also gets used (for example) by
get_reqs_available(), which in turn gets used by aio_get_req() to
decide whether or not it's safe to allocate another aio_request.
Since reqs_available is getting double allocated, won't we end up
allowing more AIO requests to be issued --- more than we would have
room in the ring?

Am I missing something?

- Ted

2013-04-09 21:08:03

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 15/33] aio: use flush_dcache_page()

On Mon, Apr 01, 2013 at 10:12:29PM -0400, Theodore Ts'o wrote:
> The commit description needs to explain why flush_dcache_page() is
> needed now, but wasn't needed before.

It wasn't causing problems before because it's not needed on x86, but it
is needed on other architectures. Added that to the commit description.

2013-04-09 21:15:16

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 33/33] aio: fix kioctx not being freed after cancellation at exit time

On Tue, Apr 02, 2013 at 05:35:50PM -0400, Theodore Ts'o wrote:
> On Thu, Mar 21, 2013 at 09:35:54AM -0700, Kent Overstreet wrote:
> > From: Benjamin LaHaise <[email protected]>
> >
> > The recent changes overhauling fs/aio.c introduced a bug that results in the
> > kioctx not being freed when outstanding kiocbs are cancelled at exit_aio()
> > time. Specifically, a kiocb that is cancelled has its completion events
> > discarded by batch_complete_aio(), which then fails to wake up the process
> > stuck in free_ioctx(). Fix this by removing the event suppression in
> > batch_complete_aio() and modify the wait_event() condition in free_ioctx()
> > appropriately.
>
> Once you remove the event suppression, then it means that every single
> cancelled AIO will result in ki_ctx->reqs_available getting double
> incremented, right?

I'm not sure where you're seeing the double increment...

Previously, when we were supressing the events we needed to increment
reqs_available to account for the fact that we wouldn't be doing a
put_reqs_available() when reaping the io_event.

I think the commit description could've been a bit better - this patch
is changing the behaviour of cancellation, and it makes more sense in
context with some of the other cancellation patches - instead of
returning the io_event via io_cancel(), we're returning it via
io_getevents() as it would be normally.

So all removing the event supression is doing is causing the io_events
from cancelled kiocbs to be handled just like any other io_event.

> But reqs_available gets used in more places than
> just free_ioctx(). It also gets used (for example) by
> get_reqs_available(), which in turn gets used by aio_get_req() to
> decide whether or not it's safe to allocate another aio_request.
> Since reqs_available is getting double allocated, won't we end up
> allowing more AIO requests to be issued --- more than we would have
> room in the ring?
>
> Am I missing something?

You're right about how reqs_available is used, but unless I'm missing
something the accounting is correct. Maybe we should go over it
together?

2013-04-10 21:59:20

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 30/33] block, aio: batch completion for bios/kiocbs

On Tue, Apr 02, 2013 at 03:48:03PM -0400, Theodore Ts'o wrote:
> On Thu, Mar 21, 2013 at 09:35:51AM -0700, Kent Overstreet wrote:
> > + if (unlikely(req->ki_ctx != ctx)) {
> > + kioctx_ring_unlock(ctx, tail);
> > +
> > + ctx = req->ki_ctx;
> > + tail = kioctx_ring_lock(ctx);
> > + }
>
> The only place where you're calling kioctx_ring_lock() is above, which
> is part of an unlock/lock pair.
>
> There is also a kioctx_ring_unlock at the end of batch_complete_aio():
>
> > + kioctx_ring_unlock(ctx, tail);
> > + local_irq_restore(flags);
> > + rcu_read_unlock();
>
> But I'm not seeing a matching kioctx_ring_lock() before the while loop
> in batch_complete_aio(), nor anywhere else in the file. And since
> kioctx_ring_lock() is a static function....
>
> Am I missing something?

We start out with ctx == NULL - we handle the initial kiocb the same way
we handle a kiocb with a different kioctx as the last one.

2013-04-10 22:09:31

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 30/33] block, aio: batch completion for bios/kiocbs

On Tue, Apr 02, 2013 at 03:53:03PM -0400, Theodore Ts'o wrote:
> On Thu, Mar 21, 2013 at 09:35:51AM -0700, Kent Overstreet wrote:
> > + if (unlikely(req->ki_eventfd != eventfd)) {
> > + if (eventfd) {
> > + /* Make event visible */
> > + kioctx_ring_unlock(ctx, tail);
> > + ctx = NULL;
> > +
> > + eventfd_signal(eventfd, 1);
> > + eventfd_ctx_put(eventfd);
> > + }
>
> I just noticed something else. There's a ring unlock here().... but
> there isn't a matching ring_lock(), or an exit from the function.
> Since you've set the ctx to be NULL, then subsequently, aren't we
> going to crash at the subseqent kioctx_ring_unlock() below....

No, kioctx_ring_unlock() checks for ctx == NULL - it would be more
readable I suppose to have the check outside of kioctx_ring_unlock() but
that's how it ended up... the check is needed multiple places.

>
> > +
> > + eventfd = req->ki_eventfd;
> > + req->ki_eventfd = NULL;
> > + }
> > +
> > + if (unlikely(req->ki_ctx != ctx)) {
> > + kioctx_ring_unlock(ctx, tail);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> (Or the kioctx_ring_unlock() at the end of this function after the
> while loop terminates.)
>
> - Ted
>

2013-04-12 15:42:37

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 01/33] mm: remove old aio use_mm() comment

Kent Overstreet <[email protected]> writes:

> From: Zach Brown <[email protected]>
>
> use_mm() is used in more places than just aio. There's no need to mention
> callers when describing the function.
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:42:57

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 02/33] aio: remove dead code from aio.h

Kent Overstreet <[email protected]> writes:

> From: Zach Brown <[email protected]>
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:43:17

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 03/33] gadget: remove only user of aio retry

Kent Overstreet <[email protected]> writes:

> From: Zach Brown <[email protected]>
>
> This removes the only in-tree user of aio retry. This will let us remove
> the retry code from the aio core.
>
> Removing retry is relatively easy as the USB gadget wasn't using it to
> retry IOs at all. It always fully submitted the IO in the context of the
> initial io_submit() call. It only used the AIO retry facility to get the
> submitter's mm context for copying the result of a read back to user
> space. This is easy to implement with use_mm() and a work struct, much
> like kvm does with async_pf_execute() for get_user_pages().
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:43:33

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 04/33] aio: remove retry-based AIO

Kent Overstreet <[email protected]> writes:

> From: Zach Brown <[email protected]>
>
> This removes the retry-based AIO infrastructure now that nothing in tree
> is using it.
>
> We want to remove retry-based AIO because it is fundemantally unsafe. It
> retries IO submission from a kernel thread that has only assumed the mm of
> the submitting task. All other task_struct references in the IO
> submission path will see the kernel thread, not the submitting task. This
> design flaw means that nothing of any meaningful complexity can use
> retry-based AIO.
>
> This removes all the code and data associated with the retry machinery.
> The most significant benefit of this is the removal of the locking around
> the unused run list in the submission path.
>
> This has only been compiled.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Zach Brown <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:44:07

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 05/33] char: add aio_{read,write} to /dev/{null,zero}

Kent Overstreet <[email protected]> writes:

> From: Zach Brown <[email protected]>
>
> These are handy for measuring the cost of the aio infrastructure with
> operations that do very little and complete immediately.
>
> Signed-off-by: Zach Brown <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:44:30

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 06/33] aio: kill return value of aio_complete()

Kent Overstreet <[email protected]> writes:

> Nothing used the return value, and it probably wasn't possible to use it
> safely for the locked versions (aio_complete(), aio_put_req()). Just kill
> it.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Acked-by: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:58:49

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 07/33] aio: add kiocb_cancel()

Kent Overstreet <[email protected]> writes:

> Minor refactoring, to get rid of some duplicated code
>
> [[email protected]: fix warning]
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

The patch looks to preserve existing behaviour. However, the man page
and the code disagree about the return code in the event that there is
no cancelation routine (well, in the event that the iocb could not be
canceled):

ERRORS
EAGAIN The iocb specified was not canceled.

EFAULT One of the data structures points to invalid data.

EINVAL The AIO context specified by ctx_id is invalid.

ENOSYS io_cancel() is not implemented on this architecture.

The code (before and after the patch) returns EINVAL when the iocb was
not canceled. Should we fix the code or the docs, here?

For the patch:

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 15:59:22

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 08/33] aio: move private stuff out of aio.h

Kent Overstreet <[email protected]> writes:

> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 16:02:20

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 09/33] aio: dprintk() -> pr_debug()

Kent Overstreet <[email protected]> writes:

> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Felipe Balbi <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Mark Fasheh <[email protected]>
> Cc: Joel Becker <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Asai Thambi S P <[email protected]>
> Cc: Selvan Mani <[email protected]>
> Cc: Sam Bradshaw <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Benjamin LaHaise <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
> fs/aio.c | 57 ++++++++++++++++++++++++---------------------------------
> 1 file changed, 24 insertions(+), 33 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index b3b61d1..2637555 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -8,6 +8,8 @@
> *
> * See ../COPYING for licensing terms.
> */
> +#define pr_fmt(fmt) "%s: " fmt, __func__
> +

Curious, this isn't used anywhere.

Other than that:

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 16:51:52

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 10/33] aio: do fget() after aio_get_req()

Kent Overstreet <[email protected]> writes:

> aio_get_req() will fail if we have the maximum number of requests
> outstanding, which depending on the application may not be uncommon. So
> avoid doing an unnecessary fget().


> diff --git a/fs/aio.c b/fs/aio.c
> index 2637555..4f23d43 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -587,6 +587,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
> {
> assert_spin_locked(&ctx->ctx_lock);
>
> + if (req->ki_filp)
> + fput(req->ki_filp);
> if (req->ki_eventfd != NULL)
> eventfd_ctx_put(req->ki_eventfd);
> if (req->ki_dtor)
[snip]
> @@ -618,8 +617,6 @@ static void __aio_put_req(struct kioctx *ctx, struct kiocb *req)
> req->ki_cancel = NULL;
> req->ki_retry = NULL;
>
> - fput(req->ki_filp);
> - req->ki_filp = NULL;
> really_put_req(ctx, req);
> }

So you've removed the setting of req->ki_filp to NULL here, and I think
it's okay. The only function called after that which could possibly be
tripped up is req->ki_dtor. That function has no business looking at
ki_filp, I think (and the only in-tree user does not look at it).

Acked-by: Jeff Moyer <[email protected]>

2013-04-12 19:36:07

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 23/33] generic dynamic per cpu refcounting

On Tue, Apr 02, 2013 at 12:27:38PM -0400, Theodore Ts'o wrote:
> Reviewed-by: "Theodore Ts'o" <[email protected]>
>
> > + v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS),
> > + &ref->count);
> > +
> > + if (!(v >> PCPU_COUNT_BITS) &&
> > + REF_STATUS(pcpu_count) == PCPU_REF_NONE && alloc)
> > + percpu_ref_alloc(ref, pcpu_count);
>
> This assumes that the kernel is compiled with -fno-strict-overflow.
> Which we do, and this is not the only place int the kernel where we
> depend on this, so while I was nervous before, I'm okay with it now.
> Could we at least have a comment saying that we're depending on
> -fno-strict-overflow, though?

Well, I don't think it is true that we are depending on
-fno-strict-overflow since the overflow happens in atomic_add() which is
a black box to the compiler.

It would be nice if we had unsigned atomic types... but given that we
don't and I'm pretty sure overflow in atomic types happens all over the
place that part honestly seems fine to me...

That said, I suppose a comment indicating that it is intentionally
overflowing is probably merited. Ted, Andrew, is this acceptable to you?

---
lib/percpu-refcount.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 79c6158..200088f 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -124,6 +124,13 @@ void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS),
&ref->count);

+ /*
+ * The high bits of the counter count the number of gets() that
+ * have occured; we check for overflow to call
+ * percpu_ref_alloc() every (1 << (64 - PCPU_COUNT_BITS))
+ * iterations.
+ */
+
if (!(v >> PCPU_COUNT_BITS) &&
REF_STATUS(pcpu_count) == PCPU_REF_NONE && alloc)
percpu_ref_alloc(ref, pcpu_count);
--
1.7.12.146.g16d26b1

2013-04-12 21:02:20

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 11/33] aio: make aio_put_req() lockless

Kent Overstreet <[email protected]> writes:

> Freeing a kiocb needed to touch the kioctx for three things:
>
> * Pull it off the reqs_active list
> * Decrementing reqs_active
> * Issuing a wakeup, if the kioctx was in the process of being freed.
>
> This patch moves these to aio_complete(), for a couple reasons:
>
> * aio_complete() already has to issue the wakeup, so if we drop the
> kioctx refcount before aio_complete does its wakeup we don't have to
> do it twice.
> * aio_complete currently has to take the kioctx lock, so it makes sense
> for it to pull the kiocb off the reqs_active list too.
> * A later patch is going to change reqs_active to include unreaped
> completions - this will mean allocating a kiocb doesn't have to look
> at the ringbuffer. So taking the decrement of reqs_active out of
> kiocb_free() is useful prep work for that patch.
>
> This doesn't really affect cancellation, since existing (usb) code that
> implements a cancel function still calls aio_complete() - we just have
> to make sure that aio_complete does the necessary teardown for cancelled
> kiocbs.
>
> It does affect code paths where we free kiocbs that were never
> submitted; they need to decrement reqs_active and pull the kiocb off the
> reqs_active list. This occurs in two places: kiocb_batch_free(), which
> is going away in a later patch, and the error path in io_submit_one.

After reading the patch description and the patch, I'm left wondering
whether you did this as a cleanup or a performance patch.

Anyway, I don't see any issue with it.

Acked-by: Jeff Moyer <[email protected]>

2013-04-16 01:41:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 23/33] generic dynamic per cpu refcounting

On Fri, Apr 12, 2013 at 12:36:00PM -0700, Kent Overstreet wrote:
> It would be nice if we had unsigned atomic types... but given that we
> don't and I'm pretty sure overflow in atomic types happens all over the
> place that part honestly seems fine to me...
>
> That said, I suppose a comment indicating that it is intentionally
> overflowing is probably merited. Ted, Andrew, is this acceptable to you?

Seems reasonable to me, thanks.

- Ted