2013-05-14 01:19:07

by Kent Overstreet

[permalink] [raw]
Subject: AIO refactoring/performance improvements/cancellation

This is a respin of the AIO patches that were deferred until 3.11, along
with some other stuff I had queued up.

Changes:

* Took the dynamic allocation stuff out of the percpu refcounting
patch, which Tejun was wanting. I split the dynamic bits out into
another patch, which I may resend later.

* Changed batch completion to use a singly linked list instead of an rb
tree; it now calls batch_complete_aio() early if it has to look too
far down the list.

* Some batch completion performance improvements, to avoid doing nested
irqsave/restore (which was the source of a performance regression)
and not free the kiocbs with irqs disabled.

* There's also some more assorted refactoring/minor performance
improvements that had been sitting in my tree for awhile but weren't in
the patch series that was queued up for 3.10

* And, the last few patches add cancellation for direct IO; these
patches are still preliminary but they do work and are useful for
some simple use cases.


2013-05-14 01:19:28

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 12/21] aio: convert the ioctx list to radix tree

From: Octavian Purdila <[email protected]>

When using a large number of threads performing AIO operations the IOCTX
list may get a significant number of entries which will cause significant
overhead. For example, when running this fio script:

rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
blocksize=1024; numjobs=512; thread; loops=100

on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
30% CPU time spent by lookup_ioctx:

32.51% [guest.kernel] [g] lookup_ioctx
9.19% [guest.kernel] [g] __lock_acquire.isra.28
4.40% [guest.kernel] [g] lock_release
4.19% [guest.kernel] [g] sched_clock_local
3.86% [guest.kernel] [g] local_clock
3.68% [guest.kernel] [g] native_sched_clock
3.08% [guest.kernel] [g] sched_clock_cpu
2.64% [guest.kernel] [g] lock_release_holdtime.part.11
2.60% [guest.kernel] [g] memcpy
2.33% [guest.kernel] [g] lock_acquired
2.25% [guest.kernel] [g] lock_acquire
1.84% [guest.kernel] [g] do_io_submit

This patch converts the ioctx list to a radix tree. For a performance
comparison the above FIO script was run on a 2 sockets 8 core machine.
This are the results (average and %rsd of 10 runs) for the original list
based implementation and for the radix tree based implementation:

cores 1 2 4 8 16 32
list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
%rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
%rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
% of radix
relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
to list

To consider the impact of the patch on the typical case of having only one
ctx per process the following FIO script was run:

rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
blocksize=1024; numjobs=1; thread; loops=100

on the same system and the results are the following:

list 58892 ms
%rsd 0.91%
radix 59404 ms
%rsd 0.81%
% of radix
relative 100.87%
to list

Signed-off-by: Octavian Purdila <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Josh Boyer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
arch/s390/mm/pgtable.c | 4 +--
fs/aio.c | 76 ++++++++++++++++++++++++++++++------------------
include/linux/mm_types.h | 3 +-
kernel/fork.c | 2 +-
4 files changed, 52 insertions(+), 33 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 7805ddc..500426d 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1029,7 +1029,7 @@ int s390_enable_sie(void)
task_lock(tsk);
if (!tsk->mm || atomic_read(&tsk->mm->mm_users) > 1 ||
#ifdef CONFIG_AIO
- !hlist_empty(&tsk->mm->ioctx_list) ||
+ tsk->mm->ioctx_rtree.rnode ||
#endif
tsk->mm != tsk->active_mm) {
task_unlock(tsk);
@@ -1056,7 +1056,7 @@ int s390_enable_sie(void)
task_lock(tsk);
if (!tsk->mm || atomic_read(&tsk->mm->mm_users) > 1 ||
#ifdef CONFIG_AIO
- !hlist_empty(&tsk->mm->ioctx_list) ||
+ tsk->mm->ioctx_rtree.rnode ||
#endif
tsk->mm != tsk->active_mm) {
mmput(mm);
diff --git a/fs/aio.c b/fs/aio.c
index 7ce3cd8..a127e5a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/percpu-refcount.h>
+#include <linux/radix-tree.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -68,9 +69,7 @@ struct kioctx_cpu {
struct kioctx {
struct percpu_ref users;

- /* This needs improving */
unsigned long user_id;
- struct hlist_node list;

struct __percpu kioctx_cpu *cpu;

@@ -437,10 +436,18 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
aio_nr += ctx->max_reqs;
spin_unlock(&aio_nr_lock);

- /* now link into global list. */
+ /* now insert into the radix tree */
+ err = radix_tree_preload(GFP_KERNEL);
+ if (err)
+ goto out_cleanup;
spin_lock(&mm->ioctx_lock);
- hlist_add_head_rcu(&ctx->list, &mm->ioctx_list);
+ err = radix_tree_insert(&mm->ioctx_rtree, ctx->user_id, ctx);
spin_unlock(&mm->ioctx_lock);
+ radix_tree_preload_end();
+ if (err) {
+ WARN_ONCE(1, "aio: insert into ioctx tree failed: %d", err);
+ goto out_cleanup;
+ }

pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
ctx, ctx->user_id, mm, ctx->nr_events);
@@ -483,8 +490,8 @@ static void kill_ioctx_rcu(struct rcu_head *head)
static void kill_ioctx(struct kioctx *ctx)
{
if (percpu_ref_kill(&ctx->users)) {
- hlist_del_rcu(&ctx->list);
- /* Between hlist_del_rcu() and dropping the initial ref */
+ radix_tree_delete(&current->mm->ioctx_rtree, ctx->user_id);
+ /* Between radix_tree_delete() and dropping the initial ref */
synchronize_rcu();

/*
@@ -524,25 +531,38 @@ EXPORT_SYMBOL(wait_on_sync_kiocb);
*/
void exit_aio(struct mm_struct *mm)
{
- struct kioctx *ctx;
- struct hlist_node *n;
-
- hlist_for_each_entry_safe(ctx, n, &mm->ioctx_list, list) {
- /*
- * We don't need to bother with munmap() here -
- * exit_mmap(mm) is coming and it'll unmap everything.
- * Since aio_free_ring() uses non-zero ->mmap_size
- * as indicator that it needs to unmap the area,
- * just set it to 0; aio_free_ring() is the only
- * place that uses ->mmap_size, so it's safe.
- */
- ctx->mmap_size = 0;
+ struct kioctx *ctx[16];
+ unsigned long idx = 0;
+ int count;

- if (percpu_ref_kill(&ctx->users)) {
- hlist_del_rcu(&ctx->list);
- call_rcu(&ctx->rcu_head, kill_ioctx_rcu);
+ do {
+ int i;
+
+ count = radix_tree_gang_lookup(&mm->ioctx_rtree, (void **)ctx,
+ idx, sizeof(ctx)/sizeof(void *));
+ for (i = 0; i < count; i++) {
+ void *ret;
+
+ BUG_ON(ctx[i]->user_id < idx);
+ idx = ctx[i]->user_id;
+
+ /*
+ * We don't need to bother with munmap() here -
+ * exit_mmap(mm) is coming and it'll unmap everything.
+ * Since aio_free_ring() uses non-zero ->mmap_size
+ * as indicator that it needs to unmap the area,
+ * just set it to 0; aio_free_ring() is the only
+ * place that uses ->mmap_size, so it's safe.
+ */
+ ctx[i]->mmap_size = 0;
+
+ if (percpu_ref_kill(&ctx[i]->users)) {
+ ret = radix_tree_delete(&mm->ioctx_rtree, idx);
+ BUG_ON(!ret || ret != ctx[i]);
+ call_rcu(&ctx[i]->rcu_head, kill_ioctx_rcu);
+ }
}
- }
+ } while (count);
}

static void put_reqs_available(struct kioctx *ctx, unsigned nr)
@@ -629,12 +649,10 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)

rcu_read_lock();

- hlist_for_each_entry_rcu(ctx, &mm->ioctx_list, list) {
- if (ctx->user_id == ctx_id) {
- percpu_ref_get(&ctx->users);
- ret = ctx;
- break;
- }
+ ctx = radix_tree_lookup(&mm->ioctx_rtree, ctx_id);
+ if (ctx) {
+ percpu_ref_get(&ctx->users);
+ ret = ctx;
}

rcu_read_unlock();
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..758ad98 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@
#include <linux/types.h>
#include <linux/threads.h>
#include <linux/list.h>
+#include <linux/radix-tree.h>
#include <linux/spinlock.h>
#include <linux/rbtree.h>
#include <linux/rwsem.h>
@@ -386,7 +387,7 @@ struct mm_struct {
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
- struct hlist_head ioctx_list;
+ struct radix_tree_root ioctx_rtree;
#endif
#ifdef CONFIG_MM_OWNER
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 987b28a..05d232f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -524,7 +524,7 @@ static void mm_init_aio(struct mm_struct *mm)
{
#ifdef CONFIG_AIO
spin_lock_init(&mm->ioctx_lock);
- INIT_HLIST_HEAD(&mm->ioctx_list);
+ INIT_RADIX_TREE(&mm->ioctx_rtree, GFP_KERNEL);
#endif
}

--
1.8.2.1

2013-05-14 01:19:21

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 08/21] aio: Kill aio_rw_vect_retry()

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
drivers/staging/android/logger.c | 2 +-
drivers/usb/gadget/inode.c | 6 +--
fs/aio.c | 91 ++++++++--------------------------------
fs/block_dev.c | 2 +-
fs/nfs/direct.c | 1 -
fs/ocfs2/file.c | 6 +--
fs/read_write.c | 3 --
fs/udf/file.c | 2 +-
include/linux/aio.h | 2 -
mm/page_io.c | 1 -
net/socket.c | 2 +-
11 files changed, 28 insertions(+), 90 deletions(-)

diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c
index b040200..f9dbebe 100644
--- a/drivers/staging/android/logger.c
+++ b/drivers/staging/android/logger.c
@@ -481,7 +481,7 @@ static ssize_t logger_aio_write(struct kiocb *iocb, const struct iovec *iov,
header.sec = now.tv_sec;
header.nsec = now.tv_nsec;
header.euid = current_euid();
- header.len = min_t(size_t, iocb->ki_left, LOGGER_ENTRY_MAX_PAYLOAD);
+ header.len = min_t(size_t, iocb->ki_nbytes, LOGGER_ENTRY_MAX_PAYLOAD);
header.hdr_size = sizeof(struct logger_entry);

/* null writes succeed, return zero */
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index e02c1e0..f255ad7 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -708,11 +708,11 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(usb_endpoint_dir_in(&epdata->desc)))
return -EINVAL;

- buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+ buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;

- return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
+ return ep_aio_rwtail(iocb, buf, iocb->ki_nbytes, epdata, iov, nr_segs);
}

static ssize_t
@@ -727,7 +727,7 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(!usb_endpoint_dir_in(&epdata->desc)))
return -EINVAL;

- buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+ buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;

diff --git a/fs/aio.c b/fs/aio.c
index 2c9a5ac..73ec062 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -621,7 +621,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
goto out_put;

- atomic_set(&req->ki_users, 2);
+ atomic_set(&req->ki_users, 1);
req->ki_ctx = ctx;
return req;
out_put:
@@ -965,75 +965,9 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
return -EINVAL;
}

-static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
-{
- struct iovec *iov = &iocb->ki_iovec[iocb->ki_cur_seg];
-
- BUG_ON(ret <= 0);
-
- while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) {
- ssize_t this = min((ssize_t)iov->iov_len, ret);
- iov->iov_base += this;
- iov->iov_len -= this;
- iocb->ki_left -= this;
- ret -= this;
- if (iov->iov_len == 0) {
- iocb->ki_cur_seg++;
- iov++;
- }
- }
-
- /* the caller should not have done more io than what fit in
- * the remaining iovecs */
- BUG_ON(ret > 0 && iocb->ki_left == 0);
-}
-
typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);

-static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op)
-{
- struct file *file = iocb->ki_filp;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- ssize_t ret = 0;
-
- /* This matches the pread()/pwrite() logic */
- if (iocb->ki_pos < 0)
- return -EINVAL;
-
- if (rw == WRITE)
- file_start_write(file);
- do {
- ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
- iocb->ki_nr_segs - iocb->ki_cur_seg,
- iocb->ki_pos);
- if (ret > 0)
- aio_advance_iovec(iocb, ret);
-
- /* retry all partial writes. retry partial reads as long as its a
- * regular file. */
- } while (ret > 0 && iocb->ki_left > 0 &&
- (rw == WRITE ||
- (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode))));
- if (rw == WRITE)
- file_end_write(file);
-
- /* This means we must have transferred all that we could */
- /* No need to retry anymore */
- if ((ret == 0) || (iocb->ki_left == 0))
- ret = iocb->ki_nbytes - iocb->ki_left;
-
- /* If we managed to write some out we return that, rather than
- * the eventual error. */
- if (rw == WRITE
- && ret < 0 && ret != -EIOCBQUEUED
- && iocb->ki_nbytes - iocb->ki_left)
- ret = iocb->ki_nbytes - iocb->ki_left;
-
- return ret;
-}
-
static ssize_t aio_setup_vectored_rw(int rw, struct kiocb *kiocb, bool compat)
{
ssize_t ret;
@@ -1118,9 +1052,22 @@ rw_common:
return ret;

req->ki_nbytes = ret;
- req->ki_left = ret;

- ret = aio_rw_vect_retry(req, rw, rw_op);
+ /* XXX: move/kill - rw_verify_area()? */
+ /* This matches the pread()/pwrite() logic */
+ if (req->ki_pos < 0) {
+ ret = -EINVAL;
+ break;
+ }
+
+ if (rw == WRITE)
+ file_start_write(file);
+
+ ret = rw_op(req, req->ki_iovec,
+ req->ki_nr_segs, req->ki_pos);
+
+ if (rw == WRITE)
+ file_end_write(file);
break;

case IOCB_CMD_FDSYNC:
@@ -1215,19 +1162,17 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
req->ki_pos = iocb->aio_offset;

req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
- req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+ req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;

ret = aio_run_iocb(req, compat);
if (ret)
goto out_put_req;

- aio_put_req(req); /* drop extra ref to req */
return 0;
out_put_req:
put_reqs_available(ctx, 1);
- aio_put_req(req); /* drop extra ref to req */
- aio_put_req(req); /* drop i/o ref to req */
+ aio_put_req(req);
return ret;
}

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2091db8..2964b15 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,7 +1556,7 @@ static ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
return 0;

size -= pos;
- if (size < iocb->ki_left)
+ if (size < iocb->ki_nbytes)
nr_segs = iov_shorten((struct iovec *)iov, nr_segs, size);
return generic_file_aio_read(iocb, iov, nr_segs, pos);
}
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 0bd7a55..91ff089 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -130,7 +130,6 @@ ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_

return -EINVAL;
#else
- VM_BUG_ON(iocb->ki_left != PAGE_SIZE);
VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);

if (rw == READ || rw == KERNEL_READ)
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 8a7509f..c85ad15 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2245,7 +2245,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
file->f_path.dentry->d_name.name,
(unsigned int)nr_segs);

- if (iocb->ki_left == 0)
+ if (iocb->ki_nbytes == 0)
return 0;

appending = file->f_flags & O_APPEND ? 1 : 0;
@@ -2296,7 +2296,7 @@ relock:

can_do_direct = direct_io;
ret = ocfs2_prepare_inode_for_write(file, ppos,
- iocb->ki_left, appending,
+ iocb->ki_nbytes, appending,
&can_do_direct, &has_refcount);
if (ret < 0) {
mlog_errno(ret);
@@ -2304,7 +2304,7 @@ relock:
}

if (direct_io && !is_sync_kiocb(iocb))
- unaligned_dio = ocfs2_is_io_unaligned(inode, iocb->ki_left,
+ unaligned_dio = ocfs2_is_io_unaligned(inode, iocb->ki_nbytes,
*ppos);

/*
diff --git a/fs/read_write.c b/fs/read_write.c
index 0343000..421cee4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -338,7 +338,6 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp

init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
- kiocb.ki_left = len;
kiocb.ki_nbytes = len;

ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
@@ -388,7 +387,6 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof

init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
- kiocb.ki_left = len;
kiocb.ki_nbytes = len;

ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
@@ -568,7 +566,6 @@ static ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,

init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
- kiocb.ki_left = len;
kiocb.ki_nbytes = len;

ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos);
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 29569dd..c02a27a 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -141,7 +141,7 @@ static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
int err, pos;
- size_t count = iocb->ki_left;
+ size_t count = iocb->ki_nbytes;
struct udf_inode_info *iinfo = UDF_I(inode);

down_write(&iinfo->i_data_sem);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 8c8dd1d..7bb766e 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -50,11 +50,9 @@ struct kiocb {
unsigned short ki_opcode;
size_t ki_nbytes; /* copy of iocb->aio_nbytes */
char __user *ki_buf; /* remaining iocb->aio_buf */
- size_t ki_left; /* remaining bytes */
struct iovec ki_inline_vec; /* inline vector */
struct iovec *ki_iovec;
unsigned long ki_nr_segs;
- unsigned long ki_cur_seg;

struct list_head ki_list; /* the aio core uses this
* for cancellation */
diff --git a/mm/page_io.c b/mm/page_io.c
index a8a3ef4..3db0f5f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -220,7 +220,6 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

init_sync_kiocb(&kiocb, swap_file);
kiocb.ki_pos = page_file_offset(page);
- kiocb.ki_left = PAGE_SIZE;
kiocb.ki_nbytes = PAGE_SIZE;

set_page_writeback(page);
diff --git a/net/socket.c b/net/socket.c
index 6b94633..bfe9fab 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -925,7 +925,7 @@ static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (pos != 0)
return -ESPIPE;

- if (iocb->ki_left == 0) /* Match SYS5 behaviour */
+ if (iocb->ki_nbytes == 0) /* Match SYS5 behaviour */
return 0;


--
1.8.2.1

2013-05-14 01:19:38

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 20/21] direct-io: Set dio->io_error directly

The way io errors are returned in the dio code was rather convoluted,
and also meant that the specific error code was lost. We need to return
the actual error so that for cancellation we can pass up -ECANCELED.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
fs/direct-io.c | 38 +++++++++++++++++---------------------
1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b4dd97c..9ac3011 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -271,7 +271,7 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
return ret;
}

-static int dio_bio_complete(struct dio *dio, struct bio *bio);
+static void dio_bio_complete(struct dio *dio, struct bio *bio);
/*
* Asynchronous IO callback.
*/
@@ -282,6 +282,9 @@ static void dio_bio_end_aio(struct bio *bio, int error,
unsigned long remaining;
unsigned long flags;

+ if (error)
+ dio->io_error = error;
+
/* cleanup the bio */
dio_bio_complete(dio, bio);

@@ -309,6 +312,9 @@ static void dio_bio_end_io(struct bio *bio, int error)
struct dio *dio = bio->bi_private;
unsigned long flags;

+ if (error)
+ dio->io_error = error;
+
spin_lock_irqsave(&dio->bio_lock, flags);
bio->bi_private = dio->bio_list;
dio->bio_list = bio;
@@ -438,15 +444,11 @@ static struct bio *dio_await_one(struct dio *dio)
/*
* Process one completed BIO. No locks are held.
*/
-static int dio_bio_complete(struct dio *dio, struct bio *bio)
+static void dio_bio_complete(struct dio *dio, struct bio *bio)
{
- const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec;
unsigned i;

- if (!uptodate)
- dio->io_error = -EIO;
-
if (dio->is_async && dio->rw == READ) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
@@ -459,7 +461,6 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
}
bio_put(bio);
}
- return uptodate ? 0 : -EIO;
}

/*
@@ -486,27 +487,21 @@ static void dio_await_completion(struct dio *dio)
*
* This also helps to limit the peak amount of pinned userspace memory.
*/
-static inline int dio_bio_reap(struct dio *dio, struct dio_submit *sdio)
+static inline void dio_bio_reap(struct dio *dio, struct dio_submit *sdio)
{
- int ret = 0;
-
if (sdio->reap_counter++ >= 64) {
while (dio->bio_list) {
unsigned long flags;
struct bio *bio;
- int ret2;

spin_lock_irqsave(&dio->bio_lock, flags);
bio = dio->bio_list;
dio->bio_list = bio->bi_private;
spin_unlock_irqrestore(&dio->bio_lock, flags);
- ret2 = dio_bio_complete(dio, bio);
- if (ret == 0)
- ret = ret2;
+ dio_bio_complete(dio, bio);
}
sdio->reap_counter = 0;
}
- return ret;
}

/*
@@ -591,19 +586,20 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
sector_t start_sector, struct buffer_head *map_bh)
{
sector_t sector;
- int ret, nr_pages;
+ int nr_pages;
+
+ dio_bio_reap(dio, sdio);
+
+ if (dio->io_error)
+ return dio->io_error;

- ret = dio_bio_reap(dio, sdio);
- if (ret)
- goto out;
sector = start_sector << (sdio->blkbits - 9);
nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
nr_pages = min(nr_pages, BIO_MAX_PAGES);
BUG_ON(nr_pages <= 0);
dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
sdio->boundary = 0;
-out:
- return ret;
+ return 0;
}

/*
--
1.8.2.1

2013-05-14 01:19:43

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 18/21] aio: Allow cancellation without a cancel callback, new kiocb lookup

This patch does a couple things:

* Allows cancellation of any kiocb, even if the driver doesn't
implement a ki_cancel callback function. This will be used for block
layer cancellation - there, implementing a callback is problematic,
but we can implement useful cancellation by just checking if the
kicob has been marked as cancelled when it goes to dequeue the
request.

* Implements a new lookup mechanism for cancellation.

Previously, to cancel a kiocb we had to look it up in a linked list,
and kiocbs were added to the linked list lazily. But if any kiocb is
cancellable, the lazy list adding no longer works, so we need a new
mechanism.

This is done by allocating kiocbs out of a (lazily allocated) array
of pages, which means we can refer to the kiocbs (and iterate over
them) with small integers - we use the percpu tag allocation code for
allocating individual kiocbs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
fs/aio.c | 207 +++++++++++++++++++++++++++++++++-------------------
include/linux/aio.h | 92 ++++++++++++++++-------
2 files changed, 197 insertions(+), 102 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index aa39194..f4ea8d5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -39,6 +39,7 @@
#include <linux/compat.h>
#include <linux/percpu-refcount.h>
#include <linux/radix-tree.h>
+#include <linux/tags.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -74,6 +75,9 @@ struct kioctx {

struct __percpu kioctx_cpu *cpu;

+ struct tag_pool kiocb_tags;
+ struct page **kiocb_pages;
+
/*
* For percpu reqs_available, number of slots we move to/from global
* counter at a time:
@@ -113,11 +117,6 @@ struct kioctx {
} ____cacheline_aligned_in_smp;

struct {
- spinlock_t ctx_lock;
- struct list_head active_reqs; /* used for cancellation */
- } ____cacheline_aligned_in_smp;
-
- struct {
struct mutex ring_lock;
wait_queue_head_t wait;
} ____cacheline_aligned_in_smp;
@@ -136,16 +135,25 @@ unsigned long aio_nr; /* current system wide number of aio requests */
unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
/*----end sysctl variables---*/

-static struct kmem_cache *kiocb_cachep;
static struct kmem_cache *kioctx_cachep;

+#define KIOCBS_PER_PAGE (PAGE_SIZE / sizeof(struct kiocb))
+
+static inline struct kiocb *kiocb_from_id(struct kioctx *ctx, unsigned id)
+{
+ struct page *p = ctx->kiocb_pages[id / KIOCBS_PER_PAGE];
+
+ return p
+ ? ((struct kiocb *) page_address(p)) + (id % KIOCBS_PER_PAGE)
+ : NULL;
+}
+
/* aio_setup
* Creates the slab caches used by the aio routines, panic on
* failure as this is done early during the boot sequence.
*/
static int __init aio_setup(void)
{
- kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);

pr_debug("sizeof(struct page) = %zu\n", sizeof(struct page));
@@ -245,45 +253,58 @@ static int aio_setup_ring(struct kioctx *ctx)

void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel)
{
- struct kioctx *ctx = req->ki_ctx;
- unsigned long flags;
-
- spin_lock_irqsave(&ctx->ctx_lock, flags);
+ kiocb_cancel_fn *p, *old = req->ki_cancel;

- if (!req->ki_list.next)
- list_add(&req->ki_list, &ctx->active_reqs);
-
- req->ki_cancel = cancel;
+ do {
+ if (old == KIOCB_CANCELLED) {
+ cancel(req);
+ return;
+ }

- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ p = old;
+ old = cmpxchg(&req->ki_cancel, old, cancel);
+ } while (old != p);
}
EXPORT_SYMBOL(kiocb_set_cancel_fn);

-static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb)
+static void kiocb_cancel(struct kioctx *ctx, struct kiocb *req)
{
- kiocb_cancel_fn *old, *cancel;
+ kiocb_cancel_fn *old, *new, *cancel = req->ki_cancel;

- /*
- * Don't want to set kiocb->ki_cancel = KIOCB_CANCELLED unless it
- * actually has a cancel function, hence the cmpxchg()
- */
+ local_irq_disable();

- cancel = ACCESS_ONCE(kiocb->ki_cancel);
do {
- if (!cancel || cancel == KIOCB_CANCELLED)
- return -EINVAL;
+ if (cancel == KIOCB_CANCELLING ||
+ cancel == KIOCB_CANCELLED)
+ goto out;

old = cancel;
- cancel = cmpxchg(&kiocb->ki_cancel, old, KIOCB_CANCELLED);
- } while (cancel != old);
+ new = cancel ? KIOCB_CANCELLING : KIOCB_CANCELLED;
+
+ cancel = cmpxchg(&req->ki_cancel, old, KIOCB_CANCELLING);
+ } while (old != cancel);

- return cancel(kiocb);
+ if (cancel) {
+ cancel(req);
+ smp_wmb();
+ req->ki_cancel = KIOCB_CANCELLED;
+ }
+out:
+ local_irq_enable();
}

static void free_ioctx_rcu(struct rcu_head *head)
{
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+ unsigned i;
+
+ for (i = 0; i < DIV_ROUND_UP(ctx->nr_events, KIOCBS_PER_PAGE); i++)
+ if (ctx->kiocb_pages[i])
+ __free_page(ctx->kiocb_pages[i]);

+ kfree(ctx->kiocb_pages);
+
+ tag_pool_free(&ctx->kiocb_tags);
free_percpu(ctx->cpu);
kmem_cache_free(kioctx_cachep, ctx);
}
@@ -296,21 +317,16 @@ static void free_ioctx_rcu(struct rcu_head *head)
static void free_ioctx(struct kioctx *ctx)
{
struct aio_ring *ring;
- struct kiocb *req;
- unsigned cpu, avail;
+ unsigned i, cpu, avail;
DEFINE_WAIT(wait);

- spin_lock_irq(&ctx->ctx_lock);
+ for (i = 0; i < ctx->nr_events; i++) {
+ struct kiocb *req = kiocb_from_id(ctx, i);

- while (!list_empty(&ctx->active_reqs)) {
- req = list_first_entry(&ctx->active_reqs,
- struct kiocb, ki_list);
-
- list_del_init(&req->ki_list);
- kiocb_cancel(ctx, req);
+ if (req)
+ kiocb_cancel(ctx, req);
}

- spin_unlock_irq(&ctx->ctx_lock);

for_each_possible_cpu(cpu) {
struct kioctx_cpu *kcpu = per_cpu_ptr(ctx->cpu, cpu);
@@ -409,13 +425,10 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
percpu_ref_get(&ctx->users);
rcu_read_unlock();

- spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->completion_lock);
mutex_init(&ctx->ring_lock);
init_waitqueue_head(&ctx->wait);

- INIT_LIST_HEAD(&ctx->active_reqs);
-
ctx->cpu = alloc_percpu(struct kioctx_cpu);
if (!ctx->cpu)
goto out_freeref;
@@ -427,6 +440,15 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
BUG_ON(!ctx->req_batch);

+ if (tag_pool_init(&ctx->kiocb_tags, ctx->nr_events))
+ goto out_freering;
+
+ ctx->kiocb_pages =
+ kzalloc(DIV_ROUND_UP(ctx->nr_events, KIOCBS_PER_PAGE) *
+ sizeof(struct page *), GFP_KERNEL);
+ if (!ctx->kiocb_pages)
+ goto out_freetags;
+
/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
if (aio_nr + nr_events > aio_max_nr ||
@@ -456,6 +478,10 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

out_cleanup:
err = -EAGAIN;
+ kfree(ctx->kiocb_pages);
+out_freetags:
+ tag_pool_free(&ctx->kiocb_tags);
+out_freering:
aio_free_ring(ctx);
out_freepcpu:
free_percpu(ctx->cpu);
@@ -619,17 +645,46 @@ out:
static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;
+ unsigned id;

if (!get_reqs_available(ctx))
return NULL;

- req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
- if (unlikely(!req))
- goto out_put;
+ id = tag_alloc(&ctx->kiocb_tags, false);
+ if (!id)
+ goto err;
+
+ req = kiocb_from_id(ctx, id);
+ if (!req) {
+ unsigned i, page_nr = id / KIOCBS_PER_PAGE;
+ struct page *p = alloc_page(GFP_KERNEL);
+ if (!p)
+ goto err;

+ req = page_address(p);
+
+ for (i = 0; i < KIOCBS_PER_PAGE; i++) {
+ req[i].ki_cancel = KIOCB_CANCELLED;
+ req[i].ki_id = page_nr * KIOCBS_PER_PAGE + i;
+ }
+
+ smp_wmb();
+
+ if (cmpxchg(&ctx->kiocb_pages[page_nr], NULL, p) != NULL)
+ __free_page(p);
+ }
+
+ req = kiocb_from_id(ctx, id);
+
+ /*
+ * Can't set ki_cancel to NULL until we're ready for it to be
+ * cancellable - leave it as KIOCB_CANCELLED until then
+ */
+ memset(req, 0, offsetof(struct kiocb, ki_cancel));
req->ki_ctx = ctx;
+
return req;
-out_put:
+err:
put_reqs_available(ctx, 1);
return NULL;
}
@@ -640,7 +695,7 @@ static void kiocb_free(struct kiocb *req)
fput(req->ki_filp);
if (req->ki_eventfd != NULL)
eventfd_ctx_put(req->ki_eventfd);
- kmem_cache_free(kiocb_cachep, req);
+ tag_free(&req->ki_ctx->kiocb_tags, req->ki_id);
}

static struct kioctx *lookup_ioctx(unsigned long ctx_id)
@@ -770,17 +825,21 @@ EXPORT_SYMBOL(batch_complete_aio);
void aio_complete_batch(struct kiocb *req, long res, long res2,
struct batch_complete *batch)
{
- req->ki_res = res;
- req->ki_res2 = res2;
+ kiocb_cancel_fn *old = NULL, *cancel = req->ki_cancel;
+
+ do {
+ if (cancel == KIOCB_CANCELLING) {
+ cpu_relax();
+ cancel = req->ki_cancel;
+ continue;
+ }

- if (req->ki_list.next) {
- struct kioctx *ctx = req->ki_ctx;
- unsigned long flags;
+ old = cancel;
+ cancel = cmpxchg(&req->ki_cancel, old, KIOCB_CANCELLED);
+ } while (old != cancel);

- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_del(&req->ki_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- }
+ req->ki_res = res;
+ req->ki_res2 = res2;

/*
* Special case handling for sync iocbs:
@@ -1204,7 +1263,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
}
}

- ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
+ ret = put_user(req->ki_id, &user_iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
goto out_put_req;
@@ -1215,6 +1274,13 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
req->ki_pos = iocb->aio_offset;
req->ki_nbytes = iocb->aio_nbytes;

+ /*
+ * ki_obj.user must point to the right iocb before making the kiocb
+ * cancellable by setting ki_cancel = NULL:
+ */
+ smp_wmb();
+ req->ki_cancel = NULL;
+
ret = aio_run_iocb(req, iocb->aio_lio_opcode,
(char __user *)(unsigned long)iocb->aio_buf,
compat);
@@ -1305,19 +1371,16 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
static struct kiocb *lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb,
u32 key)
{
- struct list_head *pos;
-
- assert_spin_locked(&ctx->ctx_lock);
+ struct kiocb *req;

- if (key != KIOCB_KEY)
+ if (key > ctx->nr_events)
return NULL;

- /* TODO: use a hash or array, this sucks. */
- list_for_each(pos, &ctx->active_reqs) {
- struct kiocb *kiocb = list_kiocb(pos);
- if (kiocb->ki_obj.user == iocb)
- return kiocb;
- }
+ req = kiocb_from_id(ctx, key);
+
+ if (req && req->ki_obj.user == iocb)
+ return req;
+
return NULL;
}

@@ -1347,17 +1410,9 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
if (unlikely(!ctx))
return -EINVAL;

- spin_lock_irq(&ctx->ctx_lock);
-
kiocb = lookup_kiocb(ctx, iocb, key);
- if (kiocb)
- ret = kiocb_cancel(ctx, kiocb);
- else
- ret = -EINVAL;
-
- spin_unlock_irq(&ctx->ctx_lock);
-
- if (!ret) {
+ if (kiocb) {
+ kiocb_cancel(ctx, kiocb);
/*
* The result argument is no longer used - the io_event is
* always delivered via the ring buffer. -EINPROGRESS indicates
diff --git a/include/linux/aio.h b/include/linux/aio.h
index a6fe048..985e664 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -13,31 +13,80 @@ struct kioctx;
struct kiocb;
struct batch_complete;

-#define KIOCB_KEY 0
-
/*
- * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
- * cancelled or completed (this makes a certain amount of sense because
- * successful cancellation - io_cancel() - does deliver the completion to
- * userspace).
+ * CANCELLATION
+ *
+ * SEMANTICS:
+ *
+ * Userspace may indicate (via io_cancel()) that they wish an iocb to be
+ * cancelled. io_cancel() does nothing more than indicate that the iocb should
+ * be cancelled if possible; it does not indicate whether it succeeded (nor will
+ * it block).
+ *
+ * If cancellation does succeed, userspace should be informed by passing
+ * -ECANCELLED to aio_complete(); userspace retrieves the io_event in the usual
+ * manner.
+ *
+ * DRIVERS:
+ *
+ * A driver that wishes to support cancellation may (but does not have to)
+ * implement a ki_cancel callback. If it doesn't implement a callback, it can
+ * check if the kiocb has been marked as cancelled (with kiocb_cancelled()).
+ * This is what the block layer does - when dequeuing requests it checks to see
+ * if it's for a bio that's been marked as cancelled, and if so doesn't send it
+ * to the device.
+ *
+ * Some drivers are going to need to kick something to notice that kiocb has
+ * been cancelled - those will want to implement a ki_cancel function. The
+ * callback could, say, issue a wakeup so that the thread processing the kiocb
+ * can notice the cancellation - or it might do something else entirely.
+ * kiocb->private is owned by the driver, so that ki_cancel can find the
+ * driver's state.
+ *
+ * A driver must guarantee that a kiocb completes in bounded time if it's been
+ * cancelled - this means that ki_cancel may have to guarantee forward progress.
+ *
+ * ki_cancel() may not call aio_complete().
*
- * And since most things don't implement kiocb cancellation and we'd really like
- * kiocb completion to be lockless when possible, we use ki_cancel to
- * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
- * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().
+ * SYNCHRONIZATION:
+ *
+ * The aio code ensures that after aio_complete() returns, no ki_cancel function
+ * can be called or still be executing. Thus, the driver should free whatever
+ * kiocb->private points to after calling aio_complete().
+ *
+ * Drivers must not set kiocb->ki_cancel directly; they should use
+ * kiocb_set_cancel_fn(), which guards against races with kiocb_cancel(). It
+ * might be the case that userspace cancelled the iocb before the driver called
+ * kiocb_set_cancel_fn() - in that case, kiocb_set_cancel_fn() will immediately
+ * call the cancel function you passed it, and leave ki_cancel set to
+ * KIOCB_CANCELLED.
+ */
+
+/*
+ * Special values for kiocb->ki_cancel - these indicate that a kiocb has either
+ * been cancelled, or has a ki_cancel function currently running.
*/
-#define KIOCB_CANCELLED ((void *) (~0ULL))
+#define KIOCB_CANCELLED ((void *) (-1LL))
+#define KIOCB_CANCELLING ((void *) (-2LL))

typedef int (kiocb_cancel_fn)(struct kiocb *);

struct kiocb {
struct kiocb *ki_next; /* batch completion */

+ /*
+ * If the aio_resfd field of the userspace iocb is not zero,
+ * this is the underlying eventfd context to deliver events to.
+ */
+ struct eventfd_ctx *ki_eventfd;
struct file *ki_filp;
struct kioctx *ki_ctx; /* NULL for sync ops */
- kiocb_cancel_fn *ki_cancel;
void *private;

+ /* Only zero up to here in aio_get_req() */
+ kiocb_cancel_fn *ki_cancel;
+ unsigned ki_id;
+
union {
void __user *user;
struct task_struct *tsk;
@@ -49,17 +98,13 @@ struct kiocb {

loff_t ki_pos;
size_t ki_nbytes; /* copy of iocb->aio_nbytes */
-
- struct list_head ki_list; /* the aio core uses this
- * for cancellation */
-
- /*
- * If the aio_resfd field of the userspace iocb is not zero,
- * this is the underlying eventfd context to deliver events to.
- */
- struct eventfd_ctx *ki_eventfd;
};

+static inline bool kiocb_cancelled(struct kiocb *kiocb)
+{
+ return kiocb->ki_cancel == KIOCB_CANCELLED;
+}
+
static inline bool is_sync_kiocb(struct kiocb *kiocb)
{
return kiocb->ki_ctx == NULL;
@@ -107,11 +152,6 @@ static inline void aio_complete(struct kiocb *iocb, long res, long res2)
aio_complete_batch(iocb, res, res2, NULL);
}

-static inline struct kiocb *list_kiocb(struct list_head *h)
-{
- return list_entry(h, struct kiocb, ki_list);
-}
-
/* for sysctl: */
extern unsigned long aio_nr;
extern unsigned long aio_max_nr;
--
1.8.2.1

2013-05-14 01:19:40

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 19/21] aio/usb: Update cancellation for new synchonization

Previous patch got rid of kiocb->ki_users; this was done by having
kiocb_cancel()/aio_complete() explicitly synchronize with each other.

The new rule is that when a driver calls aio_complete(), after
aio_complete() returns ki_cancel cannot be running and it's safe to
dispose of kiocb->priv. But, this means ki_cancel() won't be able to
call aio_complete() itself, or aio_complete() will deadlock.

So, update the driver.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
drivers/usb/gadget/inode.c | 61 +++++++++++++++++++++-------------------------
1 file changed, 28 insertions(+), 33 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index f255ad7..69adb87 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -522,6 +522,7 @@ struct kiocb_priv {
const struct iovec *iv;
unsigned long nr_segs;
unsigned actual;
+ int status;
};

static int ep_aio_cancel(struct kiocb *iocb)
@@ -577,14 +578,26 @@ static void ep_user_copy_worker(struct work_struct *work)
struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
struct mm_struct *mm = priv->mm;
struct kiocb *iocb = priv->iocb;
- size_t ret;

- use_mm(mm);
- ret = ep_copy_to_user(priv);
- unuse_mm(mm);
+ if (priv->iv && priv->actual) {
+ size_t ret;
+
+ use_mm(mm);
+ ret = ep_copy_to_user(priv);
+ unuse_mm(mm);
+
+ if (!priv->status)
+ priv->status = ret;
+ /*
+ * completing the iocb can drop the ctx and mm, don't touch mm
+ * after
+ */
+ }

- /* completing the iocb can drop the ctx and mm, don't touch mm after */
- aio_complete(iocb, ret, ret);
+
+ /* aio_complete() reports bytes-transferred _and_ faults */
+ aio_complete(iocb, priv->actual ? priv->actual : priv->status,
+ priv->status);

kfree(priv->buf);
kfree(priv);
@@ -596,36 +609,18 @@ static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
struct kiocb_priv *priv = iocb->private;
struct ep_data *epdata = priv->epdata;

- /* lock against disconnect (and ideally, cancel) */
- spin_lock(&epdata->dev->lock);
- priv->req = NULL;
- priv->epdata = NULL;
-
- /* if this was a write or a read returning no data then we
- * don't need to copy anything to userspace, so we can
- * complete the aio request immediately.
- */
- if (priv->iv == NULL || unlikely(req->actual == 0)) {
- kfree(req->buf);
- kfree(priv);
- iocb->private = NULL;
- /* aio_complete() reports bytes-transferred _and_ faults */
- aio_complete(iocb, req->actual ? req->actual : req->status,
- req->status);
- } else {
- /* ep_copy_to_user() won't report both; we hide some faults */
- if (unlikely(0 != req->status))
- DBG(epdata->dev, "%s fault %d len %d\n",
- ep->name, req->status, req->actual);
-
- priv->buf = req->buf;
- priv->actual = req->actual;
- schedule_work(&priv->work);
- }
- spin_unlock(&epdata->dev->lock);
+ priv->buf = req->buf;
+ priv->actual = req->actual;
+ priv->status = req->status;

usb_ep_free_request(ep, req);
put_ep(epdata);
+
+ if ((priv->iv && priv->actual) ||
+ iocb->ki_cancel == KIOCB_CANCELLING)
+ schedule_work(&priv->work);
+ else
+ ep_user_copy_worker(&priv->work);
}

static ssize_t
--
1.8.2.1

2013-05-14 01:20:52

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 21/21] block: Bio cancellation

If a bio is associated with a kiocb, allow it to be cancelled.

This is accomplished by adding a pointer to a kiocb in struct bio, and
when we go to dequeue a request we check if its bio has been cancelled -
if so, we end the request with -ECANCELED.

We don't currently try to cancel bios if IO has already been started -
that'd require a per bio callback function, and a way to find all the
outstanding bios for a given kiocb. Such a mechanism may or may not be
added in the future but this patch tries to start simple.

Currently this can only be triggered with aio and io_cancel(), but the
mechanism can be used for sync io too.

It can also be used for bios created by stacking drivers, and bio clones
in general - when cloning a bio, if the bi_iocb pointer is copied as
well the clone will then be cancellable. bio_clone() could be modified
to do this, but hasn't in this patch because all the bio_clone() users
would need to be auditied to make sure that it's safe. We can't blindly
make e.g. raid5 writes cancellable without the knowledge of the md code.

Initial patch by Anatol Pomazau ([email protected]).

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
block/blk-core.c | 15 +++++++++++++++
fs/direct-io.c | 1 +
include/linux/aio.h | 6 ++++++
include/linux/blk_types.h | 1 +
4 files changed, 23 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 94aa4e7..6bb99b6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -31,6 +31,7 @@
#include <linux/delay.h>
#include <linux/ratelimit.h>
#include <linux/pm_runtime.h>
+#include <linux/aio.h>

#define CREATE_TRACE_POINTS
#include <trace/events/block.h>
@@ -1744,6 +1745,11 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}

+ if (bio_cancelled(bio)) {
+ err = -ECANCELED;
+ goto end_io;
+ }
+
/*
* Various block parts want %current->io_context and lazy ioc
* allocation ends up trading a lot of pain for a small amount of
@@ -2124,6 +2130,12 @@ struct request *blk_peek_request(struct request_queue *q)
trace_block_rq_issue(q, rq);
}

+ if (rq->bio && !rq->bio->bi_next && bio_cancelled(rq->bio)) {
+ blk_start_request(rq);
+ __blk_end_request_all(rq, -ECANCELED);
+ continue;
+ }
+
if (!q->boundary_rq || q->boundary_rq == rq) {
q->end_sector = rq_end_sector(rq);
q->boundary_rq = NULL;
@@ -2308,6 +2320,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes,
char *error_type;

switch (error) {
+ case -ECANCELED:
+ goto noerr;
case -ENOLINK:
error_type = "recoverable transport";
break;
@@ -2328,6 +2342,7 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes,
(unsigned long long)blk_rq_pos(req));

}
+noerr:

blk_account_io_completion(req, nr_bytes);

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9ac3011..3ae5121 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -377,6 +377,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
unsigned long flags;

bio->bi_private = dio;
+ bio->bi_iocb = dio->iocb;

spin_lock_irqsave(&dio->bio_lock, flags);
dio->refcount++;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 985e664..4893b8b 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -8,6 +8,7 @@
#include <linux/rcupdate.h>
#include <linux/atomic.h>
#include <linux/batch_complete.h>
+#include <linux/blk_types.h>

struct kioctx;
struct kiocb;
@@ -105,6 +106,11 @@ static inline bool kiocb_cancelled(struct kiocb *kiocb)
return kiocb->ki_cancel == KIOCB_CANCELLED;
}

+static inline bool bio_cancelled(struct bio *bio)
+{
+ return bio->bi_iocb && kiocb_cancelled(bio->bi_iocb);
+}
+
static inline bool is_sync_kiocb(struct kiocb *kiocb)
{
return kiocb->ki_ctx == NULL;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9d3cafa..7252484 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -43,6 +43,7 @@ struct bio {
* top bits priority
*/

+ struct kiocb *bi_iocb;
short bi_error;
unsigned short bi_vcnt; /* how many bio_vec's */
unsigned short bi_idx; /* current index into bvl_vec */
--
1.8.2.1

2013-05-14 01:20:54

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 17/21] Percpu tag allocator

Allocates integers out of a predefined range - for use by e.g. a driver
to allocate tags for communicating with the device.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/tags.h | 38 ++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +-
lib/tags.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 209 insertions(+), 1 deletion(-)
create mode 100644 include/linux/tags.h
create mode 100644 lib/tags.c

diff --git a/include/linux/tags.h b/include/linux/tags.h
new file mode 100644
index 0000000..1b8cfca
--- /dev/null
+++ b/include/linux/tags.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator.
+ */
+
+#ifndef _LINUX_TAGS_H
+#define _LINUX_TAGS_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct tag_cpu_freelist;
+
+struct tag_pool {
+ unsigned watermark;
+ unsigned nr_tags;
+
+ struct tag_cpu_freelist *tag_cpu;
+
+ struct {
+ /* Global freelist */
+ unsigned nr_free;
+ unsigned *free;
+ spinlock_t lock;
+ struct list_head wait;
+ } ____cacheline_aligned;
+};
+
+unsigned tag_alloc(struct tag_pool *pool, bool wait);
+void tag_free(struct tag_pool *pool, unsigned tag);
+
+void tag_pool_free(struct tag_pool *pool);
+int tag_pool_init(struct tag_pool *pool, unsigned long nr_tags);
+
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index fe01d41..fa77e31 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -407,4 +407,7 @@ config OID_REGISTRY
config UCS2_STRING
tristate

+config PERCPU_TAG
+ bool
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 25a0ce1..c622107 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o percpu-refcount.o
+ earlycpio.o percpu-refcount.o tags.o

obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/tags.c b/lib/tags.c
new file mode 100644
index 0000000..5c3de28
--- /dev/null
+++ b/lib/tags.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator.
+ */
+
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/tags.h>
+
+struct tag_cpu_freelist {
+ unsigned nr_free;
+ unsigned free[];
+};
+
+struct tag_waiter {
+ struct list_head list;
+ struct task_struct *task;
+};
+
+static inline void move_tags(unsigned *dst, unsigned *dst_nr,
+ unsigned *src, unsigned *src_nr,
+ unsigned nr)
+{
+ *src_nr -= nr;
+ memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
+ *dst_nr += nr;
+}
+
+unsigned tag_alloc(struct tag_pool *pool, bool wait)
+{
+ struct tag_cpu_freelist *tags;
+ unsigned long flags;
+ unsigned ret;
+retry:
+ preempt_disable();
+ local_irq_save(flags);
+ tags = this_cpu_ptr(pool->tag_cpu);
+
+ while (!tags->nr_free) {
+ spin_lock(&pool->lock);
+
+ if (pool->nr_free)
+ move_tags(tags->free, &tags->nr_free,
+ pool->free, &pool->nr_free,
+ min(pool->nr_free, pool->watermark));
+ else if (wait) {
+ struct tag_waiter wait = { .task = current };
+
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ list_add(&wait.list, &pool->wait);
+
+ spin_unlock(&pool->lock);
+ local_irq_restore(flags);
+ preempt_enable();
+
+ schedule();
+ __set_current_state(TASK_RUNNING);
+
+ if (!list_empty_careful(&wait.list)) {
+ spin_lock_irqsave(&pool->lock, flags);
+ list_del_init(&wait.list);
+ spin_unlock_irqrestore(&pool->lock, flags);
+ }
+
+ goto retry;
+ } else
+ goto fail;
+
+ spin_unlock(&pool->lock);
+ }
+
+ ret = tags->free[--tags->nr_free];
+
+ local_irq_restore(flags);
+ preempt_enable();
+
+ return ret;
+fail:
+ local_irq_restore(flags);
+ preempt_enable();
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tag_alloc);
+
+void tag_free(struct tag_pool *pool, unsigned tag)
+{
+ struct tag_cpu_freelist *tags;
+ unsigned long flags;
+
+ preempt_disable();
+ local_irq_save(flags);
+ tags = this_cpu_ptr(pool->tag_cpu);
+
+ tags->free[tags->nr_free++] = tag;
+
+ if (tags->nr_free == pool->watermark * 2) {
+ spin_lock(&pool->lock);
+
+ move_tags(pool->free, &pool->nr_free,
+ tags->free, &tags->nr_free,
+ pool->watermark);
+
+ while (!list_empty(&pool->wait)) {
+ struct tag_waiter *wait;
+ wait = list_first_entry(&pool->wait,
+ struct tag_waiter, list);
+ list_del_init(&wait->list);
+ wake_up_process(wait->task);
+ }
+
+ spin_unlock(&pool->lock);
+ }
+
+ local_irq_restore(flags);
+ preempt_enable();
+}
+EXPORT_SYMBOL_GPL(tag_free);
+
+void tag_pool_free(struct tag_pool *pool)
+{
+ free_percpu(pool->tag_cpu);
+
+ free_pages((unsigned long) pool->free,
+ get_order(pool->nr_tags * sizeof(unsigned)));
+}
+EXPORT_SYMBOL_GPL(tag_pool_free);
+
+int tag_pool_init(struct tag_pool *pool, unsigned long nr_tags)
+{
+ unsigned i, order;
+
+ spin_lock_init(&pool->lock);
+ INIT_LIST_HEAD(&pool->wait);
+ pool->nr_tags = nr_tags;
+
+ /* Guard against overflow */
+ if (nr_tags > UINT_MAX)
+ return -ENOMEM;
+
+ order = get_order(nr_tags * sizeof(unsigned));
+ pool->free = (void *) __get_free_pages(GFP_KERNEL, order);
+ if (!pool->free)
+ return -ENOMEM;
+
+ for (i = 1; i < nr_tags; i++)
+ pool->free[pool->nr_free++] = i;
+
+ /* nr_possible_cpus would be more correct */
+ pool->watermark = nr_tags / (num_possible_cpus() * 4);
+
+ pool->watermark = min(pool->watermark, 128);
+
+ if (pool->watermark > 64)
+ pool->watermark = round_down(pool->watermark, 32);
+
+ pool->tag_cpu = __alloc_percpu(sizeof(struct tag_cpu_freelist) +
+ pool->watermark * 2 * sizeof(unsigned),
+ sizeof(unsigned));
+ if (!pool->tag_cpu)
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tag_pool_init);
--
1.8.2.1

2013-05-14 01:21:47

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 16/21] mtip32xx: convert to batch completion

[[email protected]:
* changes for conversion to bio batch completion from Kent
* fix to apply the above changes cleanly on latest mtip32xx code
* batch bio completion changes in
* mtip_command_cleanup()
* mtip_timeout_function()
* mtip_handle_tfe()]

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Asai Thambi S P <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/block/mtip32xx/mtip32xx.c | 86 ++++++++++++++++++++++-----------------
drivers/block/mtip32xx/mtip32xx.h | 8 ++--
2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 847107e..1262321 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -151,6 +151,9 @@ static void mtip_command_cleanup(struct driver_data *dd)
struct mtip_cmd *command;
struct mtip_port *port = dd->port;
static int in_progress;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (in_progress)
return;
@@ -166,11 +169,9 @@ static void mtip_command_cleanup(struct driver_data *dd)
command = &port->commands[commandindex];

if (atomic_read(&command->active)
- && (command->async_callback)) {
- command->async_callback(command->async_data,
- -ENODEV);
- command->async_callback = NULL;
- command->async_data = NULL;
+ && (command->bio)) {
+ bio_endio_batch(command->bio, -ENODEV, &batch);
+ command->bio = NULL;
}

dma_unmap_sg(&port->dd->pdev->dev,
@@ -178,9 +179,10 @@ static void mtip_command_cleanup(struct driver_data *dd)
command->scatter_ents,
command->direction);
}
+ up(&port->cmd_slot);
}

- up(&port->cmd_slot);
+ batch_complete(&batch);

set_bit(MTIP_DDF_CLEANUP_BIT, &dd->dd_flag);
in_progress = 0;
@@ -580,6 +582,9 @@ static void mtip_timeout_function(unsigned long int data)
unsigned int bit, group;
unsigned int num_command_slots;
unsigned long to, tagaccum[SLOTBITS_IN_LONGS];
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (unlikely(!port))
return;
@@ -622,11 +627,9 @@ static void mtip_timeout_function(unsigned long int data)
writel(1 << bit, port->completed[group]);

/* Call the async completion callback. */
- if (likely(command->async_callback))
- command->async_callback(command->async_data,
- -EIO);
- command->async_callback = NULL;
- command->comp_func = NULL;
+ if (likely(command->bio))
+ bio_endio_batch(command->bio, -EIO, &batch);
+ command->bio = NULL;

/* Unmap the DMA scatter list entries */
dma_unmap_sg(&port->dd->pdev->dev,
@@ -645,6 +648,8 @@ static void mtip_timeout_function(unsigned long int data)
}
}

+ batch_complete(&batch);
+
if (cmdto_cnt) {
print_tags(port->dd, "timed out", tagaccum, cmdto_cnt);
if (!test_bit(MTIP_PF_IC_ACTIVE_BIT, &port->flags)) {
@@ -695,7 +700,8 @@ static void mtip_timeout_function(unsigned long int data)
static void mtip_async_complete(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
struct mtip_cmd *command;
struct driver_data *dd = data;
@@ -712,11 +718,10 @@ static void mtip_async_complete(struct mtip_port *port,
}

/* Upper layer callback */
- if (likely(command->async_callback))
- command->async_callback(command->async_data, cb_status);
+ if (likely(command->bio))
+ bio_endio_batch(command->bio, cb_status, batch);

- command->async_callback = NULL;
- command->comp_func = NULL;
+ command->bio = NULL;

/* Unmap the DMA scatter list entries */
dma_unmap_sg(&dd->pdev->dev,
@@ -752,24 +757,22 @@ static void mtip_async_complete(struct mtip_port *port,
static void mtip_completion(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
- struct mtip_cmd *command = &port->commands[tag];
struct completion *waiting = data;
if (unlikely(status == PORT_IRQ_TF_ERR))
dev_warn(&port->dd->pdev->dev,
"Internal command %d completed with TFE\n", tag);

- command->async_callback = NULL;
- command->comp_func = NULL;
-
complete(waiting);
}

static void mtip_null_completion(struct mtip_port *port,
int tag,
void *data,
- int status)
+ int status,
+ struct batch_complete *batch)
{
return;
}
@@ -798,6 +801,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
unsigned char *buf;
char *fail_reason = NULL;
int fail_all_ncq_write = 0, fail_all_ncq_cmds = 0;
+ struct batch_complete batch;

dev_warn(&dd->pdev->dev, "Taskfile error\n");

@@ -815,13 +819,14 @@ static void mtip_handle_tfe(struct driver_data *dd)
atomic_inc(&cmd->active); /* active > 1 indicates error */
if (cmd->comp_data && cmd->comp_func) {
cmd->comp_func(port, MTIP_TAG_INTERNAL,
- cmd->comp_data, PORT_IRQ_TF_ERR);
+ cmd->comp_data, PORT_IRQ_TF_ERR, NULL);
}
goto handle_tfe_exit;
}

/* clear the tag accumulator */
memset(tagaccum, 0, SLOTBITS_IN_LONGS * sizeof(long));
+ batch_complete_init(&batch);

/* Loop through all the groups */
for (group = 0; group < dd->slot_groups; group++) {
@@ -848,7 +853,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
cmd->comp_func(port,
tag,
cmd->comp_data,
- 0);
+ 0, &batch);
} else {
dev_err(&port->dd->pdev->dev,
"Missing completion func for tag %d",
@@ -861,6 +866,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
}
}
}
+ batch_complete(&batch);

print_tags(dd, "completed (TFE)", tagaccum, cmd_cnt);

@@ -902,6 +908,7 @@ static void mtip_handle_tfe(struct driver_data *dd)

/* clear the tag accumulator */
memset(tagaccum, 0, SLOTBITS_IN_LONGS * sizeof(long));
+ batch_complete_init(&batch);

/* Loop through all the groups */
for (group = 0; group < dd->slot_groups; group++) {
@@ -935,7 +942,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
if (cmd->comp_func) {
cmd->comp_func(port, tag,
cmd->comp_data,
- -ENODATA);
+ -ENODATA, &batch);
}
continue;
}
@@ -965,13 +972,15 @@ static void mtip_handle_tfe(struct driver_data *dd)
port,
tag,
cmd->comp_data,
- PORT_IRQ_TF_ERR);
+ PORT_IRQ_TF_ERR, &batch);
else
dev_warn(&port->dd->pdev->dev,
"Bad completion for tag %d\n",
tag);
}
}
+
+ batch_complete(&batch);
print_tags(dd, "reissued (TFE)", tagaccum, cmd_cnt);

handle_tfe_exit:
@@ -992,6 +1001,9 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
struct driver_data *dd = port->dd;
int tag, bit;
struct mtip_cmd *command;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

if (!completed) {
WARN_ON_ONCE(!completed);
@@ -1016,7 +1028,8 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
port,
tag,
command->comp_data,
- 0);
+ 0,
+ &batch);
} else {
dev_warn(&dd->pdev->dev,
"Null completion "
@@ -1026,13 +1039,16 @@ static inline void mtip_workq_sdbfx(struct mtip_port *port, int group,
if (mtip_check_surprise_removal(
dd->pdev)) {
mtip_command_cleanup(dd);
- return;
+ goto out;
}
}
}
completed >>= 1;
}

+out:
+ batch_complete(&batch);
+
/* If last, re-enable interrupts */
if (atomic_dec_return(&dd->irq_workers_active) == 0)
writel(0xffffffff, dd->mmio + HOST_IRQ_STAT);
@@ -1053,7 +1069,7 @@ static inline void mtip_process_legacy(struct driver_data *dd, u32 port_stat)
cmd->comp_func(port,
MTIP_TAG_INTERNAL,
cmd->comp_data,
- 0);
+ 0, NULL);
return;
}
}
@@ -2561,8 +2577,8 @@ static int mtip_hw_ioctl(struct driver_data *dd, unsigned int cmd,
* None
*/
static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
- int nsect, int nents, int tag, void *callback,
- void *data, int dir, int unaligned)
+ int nsect, int nents, int tag,
+ struct bio *bio, int dir, int unaligned)
{
struct host_to_dev_fis *fis;
struct mtip_port *port = dd->port;
@@ -2621,12 +2637,7 @@ static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
command->comp_func = mtip_async_complete;
command->direction = dma_dir;

- /*
- * Set the completion function and data for the command passed
- * from the upper layer.
- */
- command->async_data = data;
- command->async_callback = callback;
+ command->bio = bio;

/*
* To prevent this command from being issued
@@ -3934,7 +3945,6 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
bio_sectors(bio),
nents,
tag,
- bio_endio,
bio,
bio_data_dir(bio),
unaligned);
diff --git a/drivers/block/mtip32xx/mtip32xx.h b/drivers/block/mtip32xx/mtip32xx.h
index 3bb8a29..7a2ddfd 100644
--- a/drivers/block/mtip32xx/mtip32xx.h
+++ b/drivers/block/mtip32xx/mtip32xx.h
@@ -328,11 +328,9 @@ struct mtip_cmd {
void (*comp_func)(struct mtip_port *port,
int tag,
void *data,
- int status);
- /* Additional callback function that may be called by comp_func() */
- void (*async_callback)(void *data, int status);
-
- void *async_data; /* Addl. data passed to async_callback() */
+ int status,
+ struct batch_complete *batch);
+ struct bio *bio;

int scatter_ents; /* Number of scatter list entries used */

--
1.8.2.1

2013-05-14 01:22:08

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 13/21] block: prep work for batch completion

Add a struct batch_complete * argument to bi_end_io; infrastructure to
make use of it comes in the next patch.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
block/blk-flush.c | 3 ++-
block/blk-lib.c | 3 ++-
drivers/block/drbd/drbd_bitmap.c | 3 ++-
drivers/block/drbd/drbd_worker.c | 9 ++++++---
drivers/block/drbd/drbd_wrappers.h | 9 ++++++---
drivers/block/floppy.c | 3 ++-
drivers/block/pktcdvd.c | 9 ++++++---
drivers/block/xen-blkback/blkback.c | 3 ++-
drivers/md/bcache/alloc.c | 3 ++-
drivers/md/bcache/btree.c | 3 ++-
drivers/md/bcache/debug.c | 3 ++-
drivers/md/bcache/io.c | 6 ++++--
drivers/md/bcache/journal.c | 9 ++++++---
drivers/md/bcache/movinggc.c | 3 ++-
drivers/md/bcache/request.c | 9 ++++++---
drivers/md/bcache/request.h | 3 +--
drivers/md/bcache/super.c | 11 +++++++----
drivers/md/bcache/writeback.c | 8 +++++---
drivers/md/dm-bufio.c | 9 +++++----
drivers/md/dm-cache-target.c | 3 ++-
drivers/md/dm-crypt.c | 3 ++-
drivers/md/dm-io.c | 2 +-
drivers/md/dm-snap.c | 3 ++-
drivers/md/dm-thin.c | 3 ++-
drivers/md/dm-verity.c | 3 ++-
drivers/md/dm.c | 6 ++++--
drivers/md/faulty.c | 3 ++-
drivers/md/md.c | 9 ++++++---
drivers/md/multipath.c | 3 ++-
drivers/md/raid1.c | 12 ++++++++----
drivers/md/raid10.c | 18 ++++++++++++------
drivers/md/raid5.c | 15 ++++++++++-----
drivers/target/target_core_iblock.c | 6 ++++--
drivers/target/target_core_pscsi.c | 3 ++-
fs/bio-integrity.c | 3 ++-
fs/bio.c | 17 +++++++++++------
fs/btrfs/check-integrity.c | 14 +++++++++-----
fs/btrfs/compression.c | 6 ++++--
fs/btrfs/disk-io.c | 6 ++++--
fs/btrfs/extent_io.c | 12 ++++++++----
fs/btrfs/inode.c | 13 ++++++++-----
fs/btrfs/raid56.c | 9 ++++++---
fs/btrfs/scrub.c | 18 ++++++++++++------
fs/btrfs/volumes.c | 5 +++--
fs/buffer.c | 3 ++-
fs/direct-io.c | 9 +++------
fs/ext4/page-io.c | 3 ++-
fs/f2fs/data.c | 2 +-
fs/f2fs/segment.c | 3 ++-
fs/gfs2/lops.c | 3 ++-
fs/gfs2/ops_fstype.c | 3 ++-
fs/hfsplus/wrapper.c | 3 ++-
fs/jfs/jfs_logmgr.c | 4 ++--
fs/jfs/jfs_metapage.c | 6 ++++--
fs/logfs/dev_bdev.c | 8 +++++---
fs/mpage.c | 2 +-
fs/nfs/blocklayout/blocklayout.c | 17 ++++++++++-------
fs/nilfs2/segbuf.c | 3 ++-
fs/ocfs2/cluster/heartbeat.c | 4 ++--
fs/xfs/xfs_aops.c | 3 ++-
fs/xfs/xfs_buf.c | 3 ++-
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 3 ++-
include/linux/fs.h | 2 +-
include/linux/swap.h | 9 ++++++---
mm/bounce.c | 12 ++++++++----
mm/page_io.c | 8 +++++---
67 files changed, 267 insertions(+), 152 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index cc2b827..762cfca 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -384,7 +384,8 @@ void blk_abort_flushes(struct request_queue *q)
}
}

-static void bio_end_flush(struct bio *bio, int err)
+static void bio_end_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index d6f50d5..279f9de 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -15,7 +15,8 @@ struct bio_batch {
struct completion *wait;
};

-static void bio_batch_end_io(struct bio *bio, int err)
+static void bio_batch_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_batch *bb = bio->bi_private;

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 64fbb83..046aa17 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -948,7 +948,8 @@ static void bm_aio_ctx_destroy(struct kref *kref)
}

/* bv_page may be a copy, or may be the original */
-static void bm_async_io_complete(struct bio *bio, int error)
+static void bm_async_io_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bm_aio_ctx *ctx = bio->bi_private;
struct drbd_conf *mdev = ctx->mdev;
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 891c0ec..04a80af 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -64,7 +64,8 @@ rwlock_t global_state_lock;
/* used for synchronous meta data and bitmap IO
* submitted by drbd_md_sync_page_io()
*/
-void drbd_md_io_complete(struct bio *bio, int error)
+void drbd_md_io_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct drbd_md_io *md_io;
struct drbd_conf *mdev;
@@ -167,7 +168,8 @@ static void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req) __rel
/* writes on behalf of the partner, or resync writes,
* "submitted" by the receiver.
*/
-void drbd_peer_request_endio(struct bio *bio, int error)
+void drbd_peer_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct drbd_peer_request *peer_req = bio->bi_private;
struct drbd_conf *mdev = peer_req->w.mdev;
@@ -203,7 +205,8 @@ void drbd_peer_request_endio(struct bio *bio, int error)

/* read, readA or write requests on R_PRIMARY coming from drbd_make_request
*/
-void drbd_request_endio(struct bio *bio, int error)
+void drbd_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
unsigned long flags;
struct drbd_request *req = bio->bi_private;
diff --git a/drivers/block/drbd/drbd_wrappers.h b/drivers/block/drbd/drbd_wrappers.h
index 328f18e..d443dc0 100644
--- a/drivers/block/drbd/drbd_wrappers.h
+++ b/drivers/block/drbd/drbd_wrappers.h
@@ -20,9 +20,12 @@ static inline void drbd_set_my_capacity(struct drbd_conf *mdev,
#define drbd_bio_uptodate(bio) bio_flagged(bio, BIO_UPTODATE)

/* bi_end_io handlers */
-extern void drbd_md_io_complete(struct bio *bio, int error);
-extern void drbd_peer_request_endio(struct bio *bio, int error);
-extern void drbd_request_endio(struct bio *bio, int error);
+extern void drbd_md_io_complete(struct bio *bio, int error,
+ struct batch_complete *batch);
+extern void drbd_peer_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch);
+extern void drbd_request_endio(struct bio *bio, int error,
+ struct batch_complete *batch);

/*
* used to submit our private bio
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 04ceb7e..d528753 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3746,7 +3746,8 @@ static unsigned int floppy_check_events(struct gendisk *disk,
* a disk in the drive, and whether that disk is writable.
*/

-static void floppy_rb0_complete(struct bio *bio, int err)
+static void floppy_rb0_complete(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 3c08983..898fa74 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -980,7 +980,8 @@ static void pkt_make_local_copy(struct packet_data *pkt, struct bio_vec *bvec)
}
}

-static void pkt_end_io_read(struct bio *bio, int err)
+static void pkt_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_data *pkt = bio->bi_private;
struct pktcdvd_device *pd = pkt->pd;
@@ -998,7 +999,8 @@ static void pkt_end_io_read(struct bio *bio, int err)
pkt_bio_finished(pd);
}

-static void pkt_end_io_packet_write(struct bio *bio, int err)
+static void pkt_end_io_packet_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_data *pkt = bio->bi_private;
struct pktcdvd_device *pd = pkt->pd;
@@ -2337,7 +2339,8 @@ static void pkt_close(struct gendisk *disk, fmode_t mode)
}


-static void pkt_end_io_read_cloned(struct bio *bio, int err)
+static void pkt_end_io_read_cloned(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct packet_stacked_data *psd = bio->bi_private;
struct pktcdvd_device *pd = psd->pd;
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index dd5b2fe..990c1d8 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -741,7 +741,8 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
/*
* bio callback.
*/
-static void end_block_io_op(struct bio *bio, int error)
+static void end_block_io_op(struct bio *bio, int error,
+ struct batch_complete *batch)
{
__end_block_io_op(bio->bi_private, error);
bio_put(bio);
diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 048f294..1f75edd 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -156,7 +156,8 @@ static void discard_finish(struct work_struct *w)
closure_put(&ca->set->cl);
}

-static void discard_endio(struct bio *bio, int error)
+static void discard_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct discard *d = container_of(bio, struct discard, bio);
schedule_work(&d->work);
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 7a5658f..36688d6 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -134,7 +134,8 @@ static uint64_t btree_csum_set(struct btree *b, struct bset *i)
return crc ^ 0xffffffffffffffffULL;
}

-static void btree_bio_endio(struct bio *bio, int error)
+static void btree_bio_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
struct btree *b = container_of(cl, struct btree, io.cl);
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 89fd520..3a32b06 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -177,7 +177,8 @@ void bch_btree_verify(struct btree *b, struct bset *new)
mutex_unlock(&b->c->verify_lock);
}

-static void data_verify_endio(struct bio *bio, int error)
+static void data_verify_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
closure_put(cl);
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index 48efd4d..29f344b 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -9,7 +9,8 @@
#include "bset.h"
#include "debug.h"

-static void bch_bi_idx_hack_endio(struct bio *bio, int error)
+static void bch_bi_idx_hack_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bio *p = bio->bi_private;

@@ -206,7 +207,8 @@ static void bch_bio_submit_split_done(struct closure *cl)
mempool_free(s, s->p->bio_split_hook);
}

-static void bch_bio_submit_split_endio(struct bio *bio, int error)
+static void bch_bio_submit_split_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 8c8dfdc..bff194b 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -22,7 +22,8 @@
* bit.
*/

-static void journal_read_endio(struct bio *bio, int error)
+static void journal_read_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
closure_put(cl);
@@ -390,7 +391,8 @@ found:

#define last_seq(j) ((j)->seq - fifo_used(&(j)->pin) + 1)

-static void journal_discard_endio(struct bio *bio, int error)
+static void journal_discard_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct journal_device *ja =
container_of(bio, struct journal_device, discard_bio);
@@ -535,7 +537,8 @@ void bch_journal_next(struct journal *j)
pr_debug("journal_pin full (%zu)", fifo_used(&j->pin));
}

-static void journal_write_endio(struct bio *bio, int error)
+static void journal_write_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct journal_write *w = bio->bi_private;

diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index 8589512..8bf7ae1 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -61,7 +61,8 @@ static void write_moving_finish(struct closure *cl)
closure_return_with_destructor(cl, moving_io_destructor);
}

-static void read_moving_endio(struct bio *bio, int error)
+static void read_moving_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct moving_io *io = container_of(bio->bi_private,
struct moving_io, s.cl);
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index e5ff12e..bc837ed 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -456,7 +456,8 @@ static void bch_insert_data_error(struct closure *cl)
bch_journal(cl);
}

-static void bch_insert_data_endio(struct bio *bio, int error)
+static void bch_insert_data_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
struct btree_op *op = container_of(cl, struct btree_op, cl);
@@ -621,7 +622,8 @@ void bch_btree_insert_async(struct closure *cl)

/* Common code for the make_request functions */

-static void request_endio(struct bio *bio, int error)
+static void request_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;

@@ -636,7 +638,8 @@ static void request_endio(struct bio *bio, int error)
closure_put(cl);
}

-void bch_cache_read_endio(struct bio *bio, int error)
+void bch_cache_read_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bbio *b = container_of(bio, struct bbio, bio);
struct closure *cl = bio->bi_private;
diff --git a/drivers/md/bcache/request.h b/drivers/md/bcache/request.h
index 254d9ab..3b79462 100644
--- a/drivers/md/bcache/request.h
+++ b/drivers/md/bcache/request.h
@@ -29,11 +29,10 @@ struct search {
struct btree_op op;
};

-void bch_cache_read_endio(struct bio *, int);
+void bch_cache_read_endio(struct bio *, int, struct batch_complete *batch);
int bch_get_congested(struct cache_set *);
void bch_insert_data(struct closure *cl);
void bch_btree_insert_async(struct closure *);
-void bch_cache_read_endio(struct bio *, int);

void bch_open_buckets_free(struct cache_set *);
int bch_open_buckets_alloc(struct cache_set *);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index c8046bc..76c7f6c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -224,7 +224,8 @@ err:
return err;
}

-static void write_bdev_super_endio(struct bio *bio, int error)
+static void write_bdev_super_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct cached_dev *dc = bio->bi_private;
/* XXX: error checking */
@@ -285,7 +286,8 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent)
closure_return(cl);
}

-static void write_super_endio(struct bio *bio, int error)
+static void write_super_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct cache *ca = bio->bi_private;

@@ -326,7 +328,7 @@ void bcache_write_super(struct cache_set *c)

/* UUID io */

-static void uuid_endio(struct bio *bio, int error)
+static void uuid_endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct closure *cl = bio->bi_private;
struct cache_set *c = container_of(cl, struct cache_set, uuid_write.cl);
@@ -490,7 +492,8 @@ static struct uuid_entry *uuid_find_empty(struct cache_set *c)
* disk.
*/

-static void prio_endio(struct bio *bio, int error)
+static void prio_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct cache *ca = bio->bi_private;

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 93e7e31..daf9347 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -253,7 +253,8 @@ static void write_dirty_finish(struct closure *cl)
closure_return_with_destructor(cl, dirty_io_destructor);
}

-static void dirty_endio(struct bio *bio, int error)
+static void dirty_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct keybuf_key *w = bio->bi_private;
struct dirty_io *io = w->private;
@@ -281,7 +282,8 @@ static void write_dirty(struct closure *cl)
continue_at(cl, write_dirty_finish, dirty_wq);
}

-static void read_dirty_endio(struct bio *bio, int error)
+static void read_dirty_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct keybuf_key *w = bio->bi_private;
struct dirty_io *io = w->private;
@@ -289,7 +291,7 @@ static void read_dirty_endio(struct bio *bio, int error)
bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0),
error, "reading dirty data from cache");

- dirty_endio(bio, error);
+ dirty_endio(bio, error, NULL);
}

static void read_dirty_submit(struct closure *cl)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 0387e05..d489dfd 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -494,7 +494,7 @@ static void dmio_complete(unsigned long error, void *context)
{
struct dm_buffer *b = context;

- b->bio.bi_end_io(&b->bio, error ? -EIO : 0);
+ b->bio.bi_end_io(&b->bio, error ? -EIO : 0, NULL);
}

static void use_dmio(struct dm_buffer *b, int rw, sector_t block,
@@ -525,7 +525,7 @@ static void use_dmio(struct dm_buffer *b, int rw, sector_t block,

r = dm_io(&io_req, 1, &region, NULL);
if (r)
- end_io(&b->bio, r);
+ end_io(&b->bio, r, NULL);
}

static void use_inline_bio(struct dm_buffer *b, int rw, sector_t block,
@@ -592,7 +592,8 @@ static void submit_io(struct dm_buffer *b, int rw, sector_t block,
* Set the error, clear B_WRITING bit and wake anyone who was waiting on
* it.
*/
-static void write_endio(struct bio *bio, int error)
+static void write_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);

@@ -965,7 +966,7 @@ found_buffer:
* The endio routine for reading: set the error, clear the bit and wake up
* anyone waiting on the buffer.
*/
-static void read_endio(struct bio *bio, int error)
+static void read_endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);

diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index df44b60..53fb7b2 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -653,7 +653,8 @@ static void defer_writethrough_bio(struct cache *cache, struct bio *bio)
wake_worker(cache);
}

-static void writethrough_endio(struct bio *bio, int err)
+static void writethrough_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct per_bio_data *pb = get_per_bio_data(bio, PB_DATA_SIZE_WT);
bio->bi_end_io = pb->saved_bi_end_io;
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 6d2d41a..ec0e3c0 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -929,7 +929,8 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
* The work is done per CPU global for all dm-crypt instances.
* They should not depend on each other and do not block.
*/
-static void crypt_endio(struct bio *clone, int error)
+static void crypt_endio(struct bio *clone, int error,
+ struct batch_complete *batch)
{
struct dm_crypt_io *io = clone->bi_private;
struct crypt_config *cc = io->cc;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index ea49834..a727b26 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -136,7 +136,7 @@ static void dec_count(struct io *io, unsigned int region, int error)
}
}

-static void endio(struct bio *bio, int error)
+static void endio(struct bio *bio, int error, struct batch_complete *batch)
{
struct io *io;
unsigned region;
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c434e5a..fb3ea3c 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1486,7 +1486,8 @@ static void start_copy(struct dm_snap_pending_exception *pe)
dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, copy_callback, pe);
}

-static void full_bio_end_io(struct bio *bio, int error)
+static void full_bio_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
void *callback_data = bio->bi_private;

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 759cffc..0390a03 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -553,7 +553,8 @@ static void copy_complete(int read_err, unsigned long write_err, void *context)
spin_unlock_irqrestore(&pool->lock, flags);
}

-static void overwrite_endio(struct bio *bio, int err)
+static void overwrite_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
unsigned long flags;
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index b948fd8..b373bb7 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -413,7 +413,8 @@ static void verity_work(struct work_struct *w)
verity_finish_io(io, verity_verify_io(io));
}

-static void verity_end_io(struct bio *bio, int error)
+static void verity_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct dm_verity_io *io = bio->bi_private;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d5370a9..9101124 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -615,7 +615,8 @@ static void dec_pending(struct dm_io *io, int error)
}
}

-static void clone_endio(struct bio *bio, int error)
+static void clone_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int r = 0;
struct dm_target_io *tio = bio->bi_private;
@@ -650,7 +651,8 @@ static void clone_endio(struct bio *bio, int error)
/*
* Partial completion handling for request-based dm
*/
-static void end_clone_bio(struct bio *clone, int error)
+static void end_clone_bio(struct bio *clone, int error,
+ struct batch_complete *batch)
{
struct dm_rq_clone_bio_info *info = clone->bi_private;
struct dm_rq_target_io *tio = info->tio;
diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 3193aef..ac8af52 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -70,7 +70,8 @@
#include <linux/seq_file.h>


-static void faulty_fail(struct bio *bio, int error)
+static void faulty_fail(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bio *b = bio->bi_private;

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 681d109..9a02686 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -379,7 +379,8 @@ EXPORT_SYMBOL(mddev_congested);
* Generic flush handling for md
*/

-static void md_end_flush(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct md_rdev *rdev = bio->bi_private;
struct mddev *mddev = rdev->mddev;
@@ -756,7 +757,8 @@ void md_rdev_clear(struct md_rdev *rdev)
}
EXPORT_SYMBOL_GPL(md_rdev_clear);

-static void super_written(struct bio *bio, int error)
+static void super_written(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct md_rdev *rdev = bio->bi_private;
struct mddev *mddev = rdev->mddev;
@@ -807,7 +809,8 @@ void md_super_wait(struct mddev *mddev)
finish_wait(&mddev->sb_wait, &wq);
}

-static void bi_complete(struct bio *bio, int error)
+static void bi_complete(struct bio *bio, int error,
+ struct batch_complete *batch)
{
complete((struct completion*)bio->bi_private);
}
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 1642eae..fecad70 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -83,7 +83,8 @@ static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err)
mempool_free(mp_bh, conf->pool);
}

-static void multipath_end_request(struct bio *bio, int error)
+static void multipath_end_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct multipath_bh *mp_bh = bio->bi_private;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5595118..d55d0d9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -294,7 +294,8 @@ static int find_bio_disk(struct r1bio *r1_bio, struct bio *bio)
return mirror;
}

-static void raid1_end_read_request(struct bio *bio, int error)
+static void raid1_end_read_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
@@ -379,7 +380,8 @@ static void r1_bio_write_done(struct r1bio *r1_bio)
}
}

-static void raid1_end_write_request(struct bio *bio, int error)
+static void raid1_end_write_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
@@ -1612,7 +1614,8 @@ abort:
}


-static void end_sync_read(struct bio *bio, int error)
+static void end_sync_read(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct r1bio *r1_bio = bio->bi_private;

@@ -1630,7 +1633,8 @@ static void end_sync_read(struct bio *bio, int error)
reschedule_retry(r1_bio);
}

-static void end_sync_write(struct bio *bio, int error)
+static void end_sync_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r1bio *r1_bio = bio->bi_private;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 59d4daa..6c63406 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -101,7 +101,8 @@ static int enough(struct r10conf *conf, int ignore);
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
int *skipped);
static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
-static void end_reshape_write(struct bio *bio, int error);
+static void end_reshape_write(struct bio *bio, int error,
+ struct batch_complete *batch);
static void end_reshape(struct r10conf *conf);

static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
@@ -358,7 +359,8 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
return r10_bio->devs[slot].devnum;
}

-static void raid10_end_read_request(struct bio *bio, int error)
+static void raid10_end_read_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -441,7 +443,8 @@ static void one_write_done(struct r10bio *r10_bio)
}
}

-static void raid10_end_write_request(struct bio *bio, int error)
+static void raid10_end_write_request(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -1912,7 +1915,8 @@ abort:
}


-static void end_sync_read(struct bio *bio, int error)
+static void end_sync_read(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct r10bio *r10_bio = bio->bi_private;
struct r10conf *conf = r10_bio->mddev->private;
@@ -1973,7 +1977,8 @@ static void end_sync_request(struct r10bio *r10_bio)
}
}

-static void end_sync_write(struct bio *bio, int error)
+static void end_sync_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
@@ -4598,7 +4603,8 @@ static int handle_reshape_read_error(struct mddev *mddev,
return 0;
}

-static void end_reshape_write(struct bio *bio, int error)
+static void end_reshape_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct r10bio *r10_bio = bio->bi_private;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9359828..d014b66 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -532,9 +532,11 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
}

static void
-raid5_end_read_request(struct bio *bi, int error);
+raid5_end_read_request(struct bio *bi, int error,
+ struct batch_complete *batch);
static void
-raid5_end_write_request(struct bio *bi, int error);
+raid5_end_write_request(struct bio *bi, int error,
+ struct batch_complete *batch);

static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
{
@@ -1713,7 +1715,8 @@ static void shrink_stripes(struct r5conf *conf)
conf->slab_cache = NULL;
}

-static void raid5_end_read_request(struct bio * bi, int error)
+static void raid5_end_read_request(struct bio *bi, int error,
+ struct batch_complete *batch)
{
struct stripe_head *sh = bi->bi_private;
struct r5conf *conf = sh->raid_conf;
@@ -1833,7 +1836,8 @@ static void raid5_end_read_request(struct bio * bi, int error)
release_stripe(sh);
}

-static void raid5_end_write_request(struct bio *bi, int error)
+static void raid5_end_write_request(struct bio *bi, int error,
+ struct batch_complete *batch)
{
struct stripe_head *sh = bi->bi_private;
struct r5conf *conf = sh->raid_conf;
@@ -3904,7 +3908,8 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
* first).
* If the read failed..
*/
-static void raid5_align_endio(struct bio *bi, int error)
+static void raid5_align_endio(struct bio *bi, int error,
+ struct batch_complete *batch)
{
struct bio* raid_bi = bi->bi_private;
struct mddev *mddev;
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 07f5f94..4e842cb 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -271,7 +271,8 @@ static void iblock_complete_cmd(struct se_cmd *cmd)
kfree(ibr);
}

-static void iblock_bio_done(struct bio *bio, int err)
+static void iblock_bio_done(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct se_cmd *cmd = bio->bi_private;
struct iblock_req *ibr = cmd->priv;
@@ -335,7 +336,8 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
blk_finish_plug(&plug);
}

-static void iblock_end_io_flush(struct bio *bio, int err)
+static void iblock_end_io_flush(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct se_cmd *cmd = bio->bi_private;

diff --git a/drivers/target/target_core_pscsi.c b/drivers/target/target_core_pscsi.c
index e992b27..1e98731 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -835,7 +835,8 @@ static ssize_t pscsi_show_configfs_dev_params(struct se_device *dev, char *b)
return bl;
}

-static void pscsi_bi_endio(struct bio *bio, int error)
+static void pscsi_bi_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
bio_put(bio);
}
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 8fb4291..69f6f80 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -510,7 +510,8 @@ static void bio_integrity_verify_fn(struct work_struct *work)
* in process context. This function postpones completion
* accordingly.
*/
-void bio_integrity_endio(struct bio *bio, int error)
+void bio_integrity_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct bio_integrity_payload *bip = bio->bi_integrity;

diff --git a/fs/bio.c b/fs/bio.c
index 94bbc04..e082907 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -760,7 +760,8 @@ struct submit_bio_ret {
int error;
};

-static void submit_bio_wait_endio(struct bio *bio, int error)
+static void submit_bio_wait_endio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct submit_bio_ret *ret = bio->bi_private;

@@ -1414,7 +1415,8 @@ void bio_unmap_user(struct bio *bio)
}
EXPORT_SYMBOL(bio_unmap_user);

-static void bio_map_kern_endio(struct bio *bio, int err)
+static void bio_map_kern_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
bio_put(bio);
}
@@ -1486,7 +1488,8 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
}
EXPORT_SYMBOL(bio_map_kern);

-static void bio_copy_kern_endio(struct bio *bio, int err)
+static void bio_copy_kern_endio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_vec *bvec;
const int read = bio_data_dir(bio) == READ;
@@ -1707,7 +1710,7 @@ void bio_endio(struct bio *bio, int error)
error = -EIO;

if (bio->bi_end_io)
- bio->bi_end_io(bio, error);
+ bio->bi_end_io(bio, error, NULL);
}
EXPORT_SYMBOL(bio_endio);

@@ -1722,7 +1725,8 @@ void bio_pair_release(struct bio_pair *bp)
}
EXPORT_SYMBOL(bio_pair_release);

-static void bio_pair_end_1(struct bio *bi, int err)
+static void bio_pair_end_1(struct bio *bi, int err,
+ struct batch_complete *batch)
{
struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);

@@ -1732,7 +1736,8 @@ static void bio_pair_end_1(struct bio *bi, int err)
bio_pair_release(bp);
}

-static void bio_pair_end_2(struct bio *bi, int err)
+static void bio_pair_end_2(struct bio *bi, int err,
+ struct batch_complete *batch)
{
struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 18af6f4..3c617b3 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -323,7 +323,8 @@ static void btrfsic_release_block_ctx(struct btrfsic_block_data_ctx *block_ctx);
static int btrfsic_read_block(struct btrfsic_state *state,
struct btrfsic_block_data_ctx *block_ctx);
static void btrfsic_dump_database(struct btrfsic_state *state);
-static void btrfsic_complete_bio_end_io(struct bio *bio, int err);
+static void btrfsic_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static int btrfsic_test_for_metadata(struct btrfsic_state *state,
char **datav, unsigned int num_pages);
static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state,
@@ -336,7 +337,8 @@ static int btrfsic_process_written_superblock(
struct btrfsic_state *state,
struct btrfsic_block *const block,
struct btrfs_super_block *const super_hdr);
-static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status);
+static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status,
+ struct batch_complete *batch);
static void btrfsic_bh_end_io(struct buffer_head *bh, int uptodate);
static int btrfsic_is_block_ref_by_superblock(const struct btrfsic_state *state,
const struct btrfsic_block *block,
@@ -1751,7 +1753,8 @@ static int btrfsic_read_block(struct btrfsic_state *state,
return block_ctx->len;
}

-static void btrfsic_complete_bio_end_io(struct bio *bio, int err)
+static void btrfsic_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -2294,7 +2297,8 @@ continue_loop:
goto again;
}

-static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status)
+static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status,
+ struct batch_complete *batch)
{
struct btrfsic_block *block = (struct btrfsic_block *)bp->bi_private;
int iodone_w_error;
@@ -2342,7 +2346,7 @@ static void btrfsic_bio_end_io(struct bio *bp, int bio_error_status)
block = next_block;
} while (NULL != block);

- bp->bi_end_io(bp, bio_error_status);
+ bp->bi_end_io(bp, bio_error_status, batch);
}

static void btrfsic_bh_end_io(struct buffer_head *bh, int uptodate)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b189bd1..2298567 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -156,7 +156,8 @@ fail:
* The compressed pages are freed here, and it must be run
* in process context
*/
-static void end_compressed_bio_read(struct bio *bio, int err)
+static void end_compressed_bio_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct compressed_bio *cb = bio->bi_private;
struct inode *inode;
@@ -266,7 +267,8 @@ static noinline void end_compressed_writeback(struct inode *inode, u64 start,
* This also calls the writeback end hooks for the file pages so that
* metadata and checksums can be updated in the file.
*/
-static void end_compressed_bio_write(struct bio *bio, int err)
+static void end_compressed_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct extent_io_tree *tree;
struct compressed_bio *cb = bio->bi_private;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4e9ebe1..4166099 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -685,7 +685,8 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
return -EIO; /* we fixed nothing */
}

-static void end_workqueue_bio(struct bio *bio, int err)
+static void end_workqueue_bio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct end_io_wq *end_io_wq = bio->bi_private;
struct btrfs_fs_info *fs_info;
@@ -3072,7 +3073,8 @@ static int write_dev_supers(struct btrfs_device *device,
* endio for the write_dev_flush, this will wake anyone waiting
* for the barrier when it is done
*/
-static void btrfs_end_empty_barrier(struct bio *bio, int err)
+static void btrfs_end_empty_barrier(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err) {
if (err == -EOPNOTSUPP)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 32d67a8..84d8b4d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2012,7 +2012,8 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
return err;
}

-static void repair_io_failure_callback(struct bio *bio, int err)
+static void repair_io_failure_callback(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete(bio->bi_private);
}
@@ -2392,7 +2393,8 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
* Scheduling is not allowed, so the extent state tree is expected
* to have one and only one object corresponding to this IO.
*/
-static void end_bio_extent_writepage(struct bio *bio, int err)
+static void end_bio_extent_writepage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct extent_io_tree *tree;
@@ -2438,7 +2440,8 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
* Scheduling is not allowed, so the extent state tree is expected
* to have one and only one object corresponding to this IO.
*/
-static void end_bio_extent_readpage(struct bio *bio, int err)
+static void end_bio_extent_readpage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -3262,7 +3265,8 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
}

-static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
+static void end_bio_extent_buffer_writepage(struct bio *bio, int err,
+ struct batch_complete *batch)
{
int uptodate = err == 0;
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b31b3b..551c8bd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6931,7 +6931,8 @@ struct btrfs_dio_private {
struct bio *orig_bio;
};

-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static void btrfs_endio_direct_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;
struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -6984,10 +6985,11 @@ failed:
/* If we had a csum failure make sure to clear the uptodate flag */
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- dio_end_io(bio, err);
+ dio_end_io(bio, err, batch);
}

-static void btrfs_endio_direct_write(struct bio *bio, int err)
+static void btrfs_endio_direct_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;
struct inode *inode = dip->inode;
@@ -7029,7 +7031,7 @@ out_done:
/* If we had an error make sure to clear the uptodate flag */
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- dio_end_io(bio, err);
+ dio_end_io(bio, err, batch);
}

static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
@@ -7043,7 +7045,8 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
return 0;
}

-static void btrfs_end_dio_bio(struct bio *bio, int err)
+static void btrfs_end_dio_bio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_dio_private *dip = bio->bi_private;

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0740621..6927575 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -850,7 +850,8 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, int err, int uptodate)
* end io function used by finish_rmw. When we finally
* get here, we've written a full stripe
*/
-static void raid_write_end_io(struct bio *bio, int err)
+static void raid_write_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_raid_bio *rbio = bio->bi_private;

@@ -1384,7 +1385,8 @@ static void set_bio_pages_uptodate(struct bio *bio)
* This will usually kick off finish_rmw once all the bios are read in, but it
* may trigger parity reconstruction if we had any errors along the way
*/
-static void raid_rmw_end_io(struct bio *bio, int err)
+static void raid_rmw_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_raid_bio *rbio = bio->bi_private;

@@ -1905,7 +1907,8 @@ cleanup_io:
* This is called only for stripes we've read from disk to
* reconstruct the parity.
*/
-static void raid_recover_end_io(struct bio *bio, int err)
+static void raid_recover_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_raid_bio *rbio = bio->bi_private;

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f489e24..ac4a48f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -200,7 +200,8 @@ static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
int is_metadata, int have_csum,
const u8 *csum, u64 generation,
u16 csum_size);
-static void scrub_complete_bio_end_io(struct bio *bio, int err);
+static void scrub_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static int scrub_repair_block_from_good_copy(struct scrub_block *sblock_bad,
struct scrub_block *sblock_good,
int force_write);
@@ -223,7 +224,8 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
u64 physical, struct btrfs_device *dev, u64 flags,
u64 gen, int mirror_num, u8 *csum, int force,
u64 physical_for_dev_replace);
-static void scrub_bio_end_io(struct bio *bio, int err);
+static void scrub_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static void scrub_bio_end_io_worker(struct btrfs_work *work);
static void scrub_block_complete(struct scrub_block *sblock);
static void scrub_remap_extent(struct btrfs_fs_info *fs_info,
@@ -240,7 +242,8 @@ static void scrub_free_wr_ctx(struct scrub_wr_ctx *wr_ctx);
static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
struct scrub_page *spage);
static void scrub_wr_submit(struct scrub_ctx *sctx);
-static void scrub_wr_bio_end_io(struct bio *bio, int err);
+static void scrub_wr_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch);
static void scrub_wr_bio_end_io_worker(struct btrfs_work *work);
static int write_page_nocow(struct scrub_ctx *sctx,
u64 physical_for_dev_replace, struct page *page);
@@ -1384,7 +1387,8 @@ static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
sblock->checksum_error = 1;
}

-static void scrub_complete_bio_end_io(struct bio *bio, int err)
+static void scrub_complete_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -1584,7 +1588,8 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
btrfsic_submit_bio(WRITE, sbio->bio);
}

-static void scrub_wr_bio_end_io(struct bio *bio, int err)
+static void scrub_wr_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct scrub_bio *sbio = bio->bi_private;
struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
@@ -2053,7 +2058,8 @@ leave_nomem:
return 0;
}

-static void scrub_bio_end_io(struct bio *bio, int err)
+static void scrub_bio_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct scrub_bio *sbio = bio->bi_private;
struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e925ce..7299b55 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5044,7 +5044,8 @@ static unsigned int extract_stripe_index_from_bio_private(void *bi_private)
return (unsigned int)((uintptr_t)bi_private) & 3;
}

-static void btrfs_end_bio(struct bio *bio, int err)
+static void btrfs_end_bio(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct btrfs_bio *bbio = extract_bbio_from_bio_private(bio->bi_private);
int is_orig_bio = 0;
@@ -5101,7 +5102,7 @@ static void btrfs_end_bio(struct bio *bio, int err)
}
kfree(bbio);

- bio_endio(bio, err);
+ bio_endio_batch(bio, err, batch);
} else if (!is_orig_bio) {
bio_put(bio);
}
diff --git a/fs/buffer.c b/fs/buffer.c
index d2a4d1b..c410422 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2882,7 +2882,8 @@ sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
}
EXPORT_SYMBOL(generic_block_bmap);

-static void end_bio_bh_io_sync(struct bio *bio, int err)
+static void end_bio_bh_io_sync(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct buffer_head *bh = bio->bi_private;

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..331fd5c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -324,12 +324,12 @@ static void dio_bio_end_io(struct bio *bio, int error)
* so that the DIO specific endio actions are dealt with after the filesystem
* has done it's completion work.
*/
-void dio_end_io(struct bio *bio, int error)
+void dio_end_io(struct bio *bio, int error, struct batch_complete *batch)
{
struct dio *dio = bio->bi_private;

if (dio->is_async)
- dio_bio_end_aio(bio, error);
+ dio_bio_end_aio(bio, error, batch);
else
dio_bio_end_io(bio, error);
}
@@ -350,10 +350,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,

bio->bi_bdev = bdev;
bio->bi_sector = first_sector;
- if (dio->is_async)
- bio->bi_end_io = dio_bio_end_aio;
- else
- bio->bi_end_io = dio_bio_end_io;
+ bio->bi_end_io = dio_end_io;

sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 19599bd..0f56709 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -258,7 +258,8 @@ static void buffer_io_error(struct buffer_head *bh)
(unsigned long long)bh->b_blocknr);
}

-static void ext4_end_bio(struct bio *bio, int error)
+static void ext4_end_bio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
ext4_io_end_t *io_end = bio->bi_private;
struct inode *inode;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 91ff93b..454fca9 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -330,7 +330,7 @@ repeat:
return page;
}

-static void read_end_io(struct bio *bio, int err)
+static void read_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index d8e84e4..e36793c 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -633,7 +633,8 @@ static const struct segment_allocation default_salloc_ops = {
.allocate_segment = allocate_segment_by_default,
};

-static void f2fs_end_io_write(struct bio *bio, int err)
+static void f2fs_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index c5fa758..91a5ebb 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -200,7 +200,8 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
*
*/

-static void gfs2_end_log_write(struct bio *bio, int error)
+static void gfs2_end_log_write(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct gfs2_sbd *sdp = bio->bi_private;
struct bio_vec *bvec;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 60ede2a..86eb657 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -155,7 +155,8 @@ static int gfs2_check_sb(struct gfs2_sbd *sdp, int silent)
return -EINVAL;
}

-static void end_bio_io_page(struct bio *bio, int error)
+static void end_bio_io_page(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

diff --git a/fs/hfsplus/wrapper.c b/fs/hfsplus/wrapper.c
index b51a607..96375a5 100644
--- a/fs/hfsplus/wrapper.c
+++ b/fs/hfsplus/wrapper.c
@@ -24,7 +24,8 @@ struct hfsplus_wd {
u16 embed_count;
};

-static void hfsplus_end_io_sync(struct bio *bio, int err)
+static void hfsplus_end_io_sync(struct bio *bio, int err,
+ struct batch_complete *batch)
{
if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index c57499d..b02926c 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -2153,7 +2153,7 @@ static void lbmStartIO(struct lbuf * bp)
/* check if journaling to disk has been disabled */
if (log->no_integrity) {
bio->bi_size = 0;
- lbmIODone(bio, 0);
+ lbmIODone(bio, 0, NULL);
} else {
submit_bio(WRITE_SYNC, bio);
INCREMENT(lmStat.submitted);
@@ -2191,7 +2191,7 @@ static int lbmIOWait(struct lbuf * bp, int flag)
*
* executed at INTIODONE level
*/
-static void lbmIODone(struct bio *bio, int error)
+static void lbmIODone(struct bio *bio, int error, struct batch_complete *batch)
{
struct lbuf *bp = bio->bi_private;
struct lbuf *nextbp, *tail;
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 6740d34..6ba6757 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -283,7 +283,8 @@ static void last_read_complete(struct page *page)
unlock_page(page);
}

-static void metapage_read_end_io(struct bio *bio, int err)
+static void metapage_read_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

@@ -338,7 +339,8 @@ static void last_write_complete(struct page *page)
end_page_writeback(page);
}

-static void metapage_write_end_io(struct bio *bio, int err)
+static void metapage_write_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct page *page = bio->bi_private;

diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 550475c..0ae2254 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -14,7 +14,8 @@

#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))

-static void request_complete(struct bio *bio, int err)
+static void request_complete(struct bio *bio, int err,
+ struct batch_complete *batch)
{
complete((struct completion *)bio->bi_private);
}
@@ -64,7 +65,8 @@ static int bdev_readpage(void *_sb, struct page *page)

static DECLARE_WAIT_QUEUE_HEAD(wq);

-static void writeseg_end_io(struct bio *bio, int err)
+static void writeseg_end_io(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -168,7 +170,7 @@ static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
}


-static void erase_end_io(struct bio *bio, int err)
+static void erase_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct super_block *sb = bio->bi_private;
diff --git a/fs/mpage.c b/fs/mpage.c
index 0face1c..a4089bb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -41,7 +41,7 @@
* status of that page is hard. See end_buffer_async_read() for the details.
* There is no point in duplicating all that complexity.
*/
-static void mpage_end_io(struct bio *bio, int err)
+static void mpage_end_io(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 434b93e..76cf695 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -143,7 +143,7 @@ bl_submit_bio(int rw, struct bio *bio)

static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par)
{
struct bio *bio;
@@ -167,7 +167,7 @@ static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
static struct bio *do_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par,
unsigned int offset, int len)
{
@@ -190,7 +190,7 @@ retry:
static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
- void (*end_io)(struct bio *, int err),
+ bio_end_io_t *end_io,
struct parallel_io *par)
{
return do_add_page_to_bio(bio, npg, rw, isect, page, be,
@@ -198,7 +198,8 @@ static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
}

/* This is basically copied from mpage_end_io_read */
-static void bl_end_io_read(struct bio *bio, int err)
+static void bl_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -380,7 +381,8 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
}
}

-static void bl_end_io_write_zero(struct bio *bio, int err)
+static void bl_end_io_write_zero(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -408,7 +410,8 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
put_parallel(par);
}

-static void bl_end_io_write(struct bio *bio, int err)
+static void bl_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
struct parallel_io *par = bio->bi_private;
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -487,7 +490,7 @@ map_block(struct buffer_head *bh, sector_t isect, struct pnfs_block_extent *be)
}

static void
-bl_read_single_end_io(struct bio *bio, int error)
+bl_read_single_end_io(struct bio *bio, int error, struct batch_complete *batch)
{
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct page *page = bvec->bv_page;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index dc9a913..680b65b 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -338,7 +338,8 @@ void nilfs_add_checksums_on_logs(struct list_head *logs, u32 seed)
/*
* BIO operations
*/
-static void nilfs_end_bio_write(struct bio *bio, int err)
+static void nilfs_end_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct nilfs_segment_buffer *segbuf = bio->bi_private;
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 42252bf..73ed9d6 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -380,8 +380,8 @@ static void o2hb_wait_on_io(struct o2hb_region *reg,
wait_for_completion(&wc->wc_io_complete);
}

-static void o2hb_bio_end_io(struct bio *bio,
- int error)
+static void o2hb_bio_end_io(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct o2hb_bio_wait_ctxt *wc = bio->bi_private;

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2b2691b..f64ee71 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -380,7 +380,8 @@ xfs_imap_valid(
STATIC void
xfs_end_bio(
struct bio *bio,
- int error)
+ int error,
+ struct batch_complete *batch)
{
xfs_ioend_t *ioend = bio->bi_private;

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 82b70bd..cee0e42 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1224,7 +1224,8 @@ _xfs_buf_ioend(
STATIC void
xfs_buf_bio_end_io(
struct bio *bio,
- int error)
+ int error,
+ struct batch_complete *batch)
{
xfs_buf_t *bp = (xfs_buf_t *)bio->bi_private;

diff --git a/include/linux/bio.h b/include/linux/bio.h
index ef24466..7f3089f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -580,7 +580,7 @@ extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_prep(struct bio *);
-extern void bio_integrity_endio(struct bio *, int);
+extern void bio_integrity_endio(struct bio *, int, struct batch_complete *);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index fa1abeb..b3195e3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -16,7 +16,8 @@ struct page;
struct block_device;
struct io_context;
struct cgroup_subsys_state;
-typedef void (bio_end_io_t) (struct bio *, int);
+struct batch_complete;
+typedef void (bio_end_io_t) (struct bio *, int, struct batch_complete *);
typedef void (bio_destructor_t) (struct bio *);

/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..0a9a6766 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2453,7 +2453,7 @@ enum {
DIO_SKIP_HOLES = 0x02,
};

-void dio_end_io(struct bio *bio, int error);
+void dio_end_io(struct bio *bio, int error, struct batch_complete *batch);

ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..ca031f7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -330,11 +330,14 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
/* linux/mm/page_io.c */
extern int swap_readpage(struct page *);
extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio, int err);
+extern void end_swap_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch);
extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
- void (*end_write_func)(struct bio *, int));
+ void (*end_write_func)(struct bio *bio, int err,
+ struct batch_complete *batch));
extern int swap_set_page_dirty(struct page *page);
-extern void end_swap_bio_read(struct bio *bio, int err);
+extern void end_swap_bio_read(struct bio *bio, int err,
+ struct batch_complete *batch);

int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
unsigned long nr_pages, sector_t start_block);
diff --git a/mm/bounce.c b/mm/bounce.c
index c9f0a43..708c1e9 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -147,12 +147,14 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
bio_put(bio);
}

-static void bounce_end_io_write(struct bio *bio, int err)
+static void bounce_end_io_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
bounce_end_io(bio, page_pool, err);
}

-static void bounce_end_io_write_isa(struct bio *bio, int err)
+static void bounce_end_io_write_isa(struct bio *bio, int err,
+ struct batch_complete *batch)
{

bounce_end_io(bio, isa_page_pool, err);
@@ -168,12 +170,14 @@ static void __bounce_end_io_read(struct bio *bio, mempool_t *pool, int err)
bounce_end_io(bio, pool, err);
}

-static void bounce_end_io_read(struct bio *bio, int err)
+static void bounce_end_io_read(struct bio *bio, int err,
+ struct batch_complete *batch)
{
__bounce_end_io_read(bio, page_pool, err);
}

-static void bounce_end_io_read_isa(struct bio *bio, int err)
+static void bounce_end_io_read_isa(struct bio *bio, int err,
+ struct batch_complete *batch)
{
__bounce_end_io_read(bio, isa_page_pool, err);
}
diff --git a/mm/page_io.c b/mm/page_io.c
index 3db0f5f..e39237d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -42,7 +42,8 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
return bio;
}

-void end_swap_bio_write(struct bio *bio, int err)
+void end_swap_bio_write(struct bio *bio, int err,
+ struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
@@ -68,7 +69,7 @@ void end_swap_bio_write(struct bio *bio, int err)
bio_put(bio);
}

-void end_swap_bio_read(struct bio *bio, int err)
+void end_swap_bio_read(struct bio *bio, int err, struct batch_complete *batch)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct page *page = bio->bi_io_vec[0].bv_page;
@@ -203,7 +204,8 @@ out:
}

int __swap_writepage(struct page *page, struct writeback_control *wbc,
- void (*end_write_func)(struct bio *, int))
+ void (*end_write_func)(struct bio *bio, int err,
+ struct batch_complete *batch))
{
struct bio *bio;
int ret = 0, rw = WRITE;
--
1.8.2.1

2013-05-14 01:22:06

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 14/21] block, aio: batch completion for bios/kiocbs

When completing a kiocb, there's some fixed overhead from touching the
kioctx's ring buffer the kiocb belongs to. Some newer high end block
devices can complete multiple IOs per interrupt, much like many network
interfaces have been for some time.

This plumbs through infrastructure so we can take advantage of multiple
completions at the interrupt level, and complete multiple kiocbs at the
same time.

Drivers have to be converted to take advantage of this, but it's a simple
change and the next patches will convert a few drivers.

To use it, an interrupt handler (or any code that completes bios or
requests) declares and initializes a struct batch_complete:

struct batch_complete batch;
batch_complete_init(&batch);

Then, instead of calling bio_endio(), it calls
bio_endio_batch(bio, err, &batch). This just adds the bio to a list in
the batch_complete.

At the end, it calls

batch_complete(&batch);

This completes all the bios all at once, building up a list of kiocbs;
then the list of kiocbs are completed all at once.

[[email protected]: fix warning]
[[email protected]: fs/aio.c needs bio.h, move bio_endio_batch() declaration somewhere rational]
[[email protected]: fix warnings]
[[email protected]: fix build error due to bio_endio_batch]
[[email protected]: fix tracepoint in batch_complete()]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
block/blk-core.c | 35 +++++---
block/blk-flush.c | 2 +-
block/blk.h | 3 +-
drivers/block/swim3.c | 2 +-
drivers/md/dm.c | 2 +-
fs/aio.c | 196 +++++++++++++++++++++++++----------------
fs/bio.c | 49 +++++++----
fs/direct-io.c | 12 +--
include/linux/aio.h | 24 ++++-
include/linux/batch_complete.h | 22 +++++
include/linux/bio.h | 36 ++++++--
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 12 ++-
13 files changed, 270 insertions(+), 126 deletions(-)
create mode 100644 include/linux/batch_complete.h

diff --git a/block/blk-core.c b/block/blk-core.c
index 33c33bc..94aa4e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -153,7 +153,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
EXPORT_SYMBOL(blk_rq_init);

static void req_bio_endio(struct request *rq, struct bio *bio,
- unsigned int nbytes, int error)
+ unsigned int nbytes, int error,
+ struct batch_complete *batch)
{
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -167,7 +168,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,

/* don't actually finish bio if it's part of flush sequence */
if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
- bio_endio(bio, error);
+ bio_endio_batch(bio, error, batch);
}

void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -2281,7 +2282,8 @@ EXPORT_SYMBOL(blk_fetch_request);
* %false - this request doesn't have any more data
* %true - this request has more data
**/
-bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
+bool blk_update_request(struct request *req, int error, unsigned int nr_bytes,
+ struct batch_complete *batch)
{
int total_bytes;

@@ -2337,7 +2339,7 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
if (bio_bytes == bio->bi_size)
req->bio = bio->bi_next;

- req_bio_endio(req, bio, bio_bytes, error);
+ req_bio_endio(req, bio, bio_bytes, error, batch);

total_bytes += bio_bytes;
nr_bytes -= bio_bytes;
@@ -2390,14 +2392,15 @@ EXPORT_SYMBOL_GPL(blk_update_request);

static bool blk_update_bidi_request(struct request *rq, int error,
unsigned int nr_bytes,
- unsigned int bidi_bytes)
+ unsigned int bidi_bytes,
+ struct batch_complete *batch)
{
- if (blk_update_request(rq, error, nr_bytes))
+ if (blk_update_request(rq, error, nr_bytes, batch))
return true;

/* Bidi request must be completed as a whole */
if (unlikely(blk_bidi_rq(rq)) &&
- blk_update_request(rq->next_rq, error, bidi_bytes))
+ blk_update_request(rq->next_rq, error, bidi_bytes, batch))
return true;

if (blk_queue_add_random(rq->q))
@@ -2480,7 +2483,7 @@ static bool blk_end_bidi_request(struct request *rq, int error,
struct request_queue *q = rq->q;
unsigned long flags;

- if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
+ if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes, NULL))
return true;

spin_lock_irqsave(q->queue_lock, flags);
@@ -2506,9 +2509,11 @@ static bool blk_end_bidi_request(struct request *rq, int error,
* %true - still buffers pending for this request
**/
bool __blk_end_bidi_request(struct request *rq, int error,
- unsigned int nr_bytes, unsigned int bidi_bytes)
+ unsigned int nr_bytes,
+ unsigned int bidi_bytes,
+ struct batch_complete *batch)
{
- if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
+ if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes, batch))
return true;

blk_finish_request(rq, error);
@@ -2609,7 +2614,7 @@ EXPORT_SYMBOL_GPL(blk_end_request_err);
**/
bool __blk_end_request(struct request *rq, int error, unsigned int nr_bytes)
{
- return __blk_end_bidi_request(rq, error, nr_bytes, 0);
+ return __blk_end_bidi_request(rq, error, nr_bytes, 0, NULL);
}
EXPORT_SYMBOL(__blk_end_request);

@@ -2621,7 +2626,8 @@ EXPORT_SYMBOL(__blk_end_request);
* Description:
* Completely finish @rq. Must be called with queue lock held.
*/
-void __blk_end_request_all(struct request *rq, int error)
+void blk_end_request_all_batch(struct request *rq, int error,
+ struct batch_complete *batch)
{
bool pending;
unsigned int bidi_bytes = 0;
@@ -2629,10 +2635,11 @@ void __blk_end_request_all(struct request *rq, int error)
if (unlikely(blk_bidi_rq(rq)))
bidi_bytes = blk_rq_bytes(rq->next_rq);

- pending = __blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes);
+ pending = __blk_end_bidi_request(rq, error, blk_rq_bytes(rq),
+ bidi_bytes, batch);
BUG_ON(pending);
}
-EXPORT_SYMBOL(__blk_end_request_all);
+EXPORT_SYMBOL(blk_end_request_all_batch);

/**
* __blk_end_request_cur - Helper function to finish the current request chunk.
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 762cfca..ab0ed23 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -316,7 +316,7 @@ void blk_insert_flush(struct request *rq)
* complete the request.
*/
if (!policy) {
- __blk_end_bidi_request(rq, 0, 0, 0);
+ __blk_end_bidi_request(rq, 0, 0, 0, NULL);
return;
}

diff --git a/block/blk.h b/block/blk.h
index e837b8f..dc8fee6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -31,7 +31,8 @@ void blk_queue_bypass_end(struct request_queue *q);
void blk_dequeue_request(struct request *rq);
void __blk_queue_free_tags(struct request_queue *q);
bool __blk_end_bidi_request(struct request *rq, int error,
- unsigned int nr_bytes, unsigned int bidi_bytes);
+ unsigned int nr_bytes, unsigned int bidi_bytes,
+ struct batch_complete *batch);

void blk_rq_timed_out_timer(unsigned long data);
void blk_delete_timer(struct request *);
diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 20e061c..9282e66 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -775,7 +775,7 @@ static irqreturn_t swim3_interrupt(int irq, void *dev_id)
if (intr & ERROR_INTR) {
n = fs->scount - 1 - resid / 512;
if (n > 0) {
- blk_update_request(req, 0, n << 9);
+ blk_update_request(req, 0, n << 9, NULL);
fs->req_sector += n;
}
if (fs->retries < 5) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9101124..2901060 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -696,7 +696,7 @@ static void end_clone_bio(struct bio *clone, int error,
* Do not use blk_end_request() here, because it may complete
* the original request before the clone, and break the ordering.
*/
- blk_update_request(tio->orig, 0, nr_bytes);
+ blk_update_request(tio->orig, 0, nr_bytes, NULL);
}

/*
diff --git a/fs/aio.c b/fs/aio.c
index a127e5a..aa39194 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -25,6 +25,7 @@
#include <linux/file.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/bio.h>
#include <linux/mmu_context.h>
#include <linux/percpu.h>
#include <linux/slab.h>
@@ -659,55 +660,11 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
return ret;
}

-/* aio_complete
- * Called when the io request on the given iocb is complete.
- */
-void aio_complete(struct kiocb *iocb, long res, long res2)
+static inline unsigned kioctx_ring_put(struct kioctx *ctx, struct kiocb *req,
+ unsigned tail)
{
- struct kioctx *ctx = iocb->ki_ctx;
- struct aio_ring *ring;
struct io_event *ev_page, *event;
- unsigned long flags;
- unsigned tail, pos;
-
- /*
- * Special case handling for sync iocbs:
- * - events go directly into the iocb for fast handling
- * - the sync task with the iocb in its stack holds the single iocb
- * ref, no other paths have a way to get another ref
- * - the sync task helpfully left a reference to itself in the iocb
- */
- if (is_sync_kiocb(iocb)) {
- iocb->ki_user_data = res;
- smp_wmb();
- iocb->ki_ctx = ERR_PTR(-EXDEV);
- wake_up_process(iocb->ki_obj.tsk);
- return;
- }
-
- /*
- * Take rcu_read_lock() in case the kioctx is being destroyed, as we
- * need to issue a wakeup after incrementing reqs_available.
- */
- rcu_read_lock();
-
- if (iocb->ki_list.next) {
- unsigned long flags;
-
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_del(&iocb->ki_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- }
-
- /*
- * Add a completion event to the ring buffer. Must be done holding
- * ctx->ctx_lock to prevent other code from messing with the tail
- * pointer since we might be called from irq context.
- */
- spin_lock_irqsave(&ctx->completion_lock, flags);
-
- tail = ctx->tail;
- pos = tail + AIO_EVENTS_OFFSET;
+ unsigned pos = tail + AIO_EVENTS_OFFSET;

if (++tail >= ctx->nr_events)
tail = 0;
@@ -715,22 +672,30 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;

- event->obj = (u64)(unsigned long)iocb->ki_obj.user;
- event->data = iocb->ki_user_data;
- event->res = res;
- event->res2 = res2;
+ event->obj = (u64)(unsigned long)req->ki_obj.user;
+ event->data = req->ki_user_data;
+ event->res = req->ki_res;
+ event->res2 = req->ki_res2;

kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
- ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
- res, res2);
+ ctx, tail, req, req->ki_obj.user, req->ki_user_data,
+ req->ki_res, req->ki_res2);

- /* after flagging the request as done, we
- * must never even look at it again
- */
- smp_wmb(); /* make event visible before updating tail */
+ return tail;
+}
+
+static inline void kioctx_ring_unlock(struct kioctx *ctx, unsigned tail)
+{
+ struct aio_ring *ring;
+
+ if (!ctx)
+ return;
+
+ smp_wmb();
+ /* make event visible before updating tail */

ctx->tail = tail;

@@ -739,20 +704,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);

- spin_unlock_irqrestore(&ctx->completion_lock, flags);
-
- pr_debug("added to ring %p at [%u]\n", iocb, tail);
-
- /*
- * Check if the user asked us to deliver the result through an
- * eventfd. The eventfd_signal() function is safe to be called
- * from IRQ context.
- */
- if (iocb->ki_eventfd != NULL)
- eventfd_signal(iocb->ki_eventfd, 1);
-
- /* everything turned out well, dispose of the aiocb. */
- kiocb_free(iocb);
+ spin_unlock(&ctx->completion_lock);

/*
* We have to order our ring_info tail store above and test
@@ -762,12 +714,108 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
smp_mb();

- if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ if (waitqueue_active(&ctx->wait)) {
+ /* Irqs are already disabled */
+ spin_lock(&ctx->wait.lock);
+ wake_up_locked(&ctx->wait);
+ spin_unlock(&ctx->wait.lock);
+ }
+}
+
+void batch_complete_aio(struct batch_complete *batch)
+{
+ struct kioctx *ctx = NULL;
+ struct kiocb *req, *next;
+ unsigned long flags;
+ unsigned tail = 0;
+
+ /*
+ * Take rcu_read_lock() in case the kioctx is being destroyed, as we
+ * need to issue a wakeup after incrementing reqs_available.
+ */
+ rcu_read_lock();
+ local_irq_save(flags);
+
+ for (req = batch->kiocb; req; req = req->ki_next) {
+ if (req->ki_ctx != ctx) {
+ kioctx_ring_unlock(ctx, tail);

+ ctx = req->ki_ctx;
+ spin_lock(&ctx->completion_lock);
+ tail = ctx->tail;
+ }
+
+ tail = kioctx_ring_put(ctx, req, tail);
+ }
+
+ kioctx_ring_unlock(ctx, tail);
+ local_irq_restore(flags);
rcu_read_unlock();
+
+ for (req = batch->kiocb; req; req = next) {
+ next = req->ki_next;
+
+ if (req->ki_eventfd)
+ eventfd_signal(req->ki_eventfd, 1);
+
+ kiocb_free(req);
+ }
+}
+EXPORT_SYMBOL(batch_complete_aio);
+
+/* aio_complete_batch
+ * Called when the io request on the given iocb is complete; @batch may be
+ * NULL.
+ */
+void aio_complete_batch(struct kiocb *req, long res, long res2,
+ struct batch_complete *batch)
+{
+ req->ki_res = res;
+ req->ki_res2 = res2;
+
+ if (req->ki_list.next) {
+ struct kioctx *ctx = req->ki_ctx;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ list_del(&req->ki_list);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ }
+
+ /*
+ * Special case handling for sync iocbs:
+ * - events go directly into the iocb for fast handling
+ * - the sync task with the iocb in its stack holds the single iocb
+ * ref, no other paths have a way to get another ref
+ * - the sync task helpfully left a reference to itself in the iocb
+ */
+ if (is_sync_kiocb(req)) {
+ req->ki_user_data = req->ki_res;
+ smp_wmb();
+ req->ki_ctx = ERR_PTR(-EXDEV);
+ wake_up_process(req->ki_obj.tsk);
+ } else if (batch) {
+ unsigned i = 0;
+ struct kiocb **p = &batch->kiocb;
+
+ while (*p && (*p)->ki_ctx > req->ki_ctx) {
+ p = &(*p)->ki_next;
+ if (++i == 16) {
+ batch_complete_aio(batch);
+ batch->kiocb = req;
+ return;
+ }
+ }
+
+ req->ki_next = *p;
+ *p = req;
+ } else {
+ struct batch_complete batch_stack = { .kiocb = req };
+
+ batch_complete_aio(&batch_stack);
+ }
}
-EXPORT_SYMBOL(aio_complete);
+EXPORT_SYMBOL(aio_complete_batch);

/* aio_read_events
* Pull an event off of the ioctx's event ring. Returns the number of
diff --git a/fs/bio.c b/fs/bio.c
index e082907..8489d7a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -28,6 +28,7 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/cgroup.h>
+#include <linux/aio.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#include <trace/events/block.h>
@@ -1688,31 +1689,41 @@ void bio_flush_dcache_pages(struct bio *bi)
EXPORT_SYMBOL(bio_flush_dcache_pages);
#endif

-/**
- * bio_endio - end I/O on a bio
- * @bio: bio
- * @error: error, if any
- *
- * Description:
- * bio_endio() will end I/O on the whole bio. bio_endio() is the
- * preferred way to end I/O on a bio, it takes care of clearing
- * BIO_UPTODATE on error. @error is 0 on success, and and one of the
- * established -Exxxx (-EIO, for instance) error values in case
- * something went wrong. No one should call bi_end_io() directly on a
- * bio unless they own it and thus know that it has an end_io
- * function.
- **/
-void bio_endio(struct bio *bio, int error)
+static inline void __bio_endio(struct bio *bio, struct batch_complete *batch)
{
- if (error)
+ if (bio->bi_error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
- error = -EIO;
+ bio->bi_error = -EIO;

if (bio->bi_end_io)
- bio->bi_end_io(bio, error, NULL);
+ bio->bi_end_io(bio, bio->bi_error, batch);
+}
+
+void bio_endio_batch(struct bio *bio, int error, struct batch_complete *batch)
+{
+ if (error)
+ bio->bi_error = error;
+
+ if (batch)
+ bio_list_add(&batch->bio, bio);
+ else
+ __bio_endio(bio, batch);
+
+}
+EXPORT_SYMBOL(bio_endio_batch);
+
+void batch_complete(struct batch_complete *batch)
+{
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&batch->bio)))
+ __bio_endio(bio, batch);
+
+ if (batch->kiocb)
+ batch_complete_aio(batch);
}
-EXPORT_SYMBOL(bio_endio);
+EXPORT_SYMBOL(batch_complete);

void bio_pair_release(struct bio_pair *bp)
{
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 331fd5c..b4dd97c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -230,7 +230,8 @@ static inline struct page *dio_get_page(struct dio *dio,
* filesystems can use it to hold additional state between get_block calls and
* dio_complete.
*/
-static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
+static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
+ bool is_async, struct batch_complete *batch)
{
ssize_t transferred = 0;

@@ -264,7 +265,7 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
} else {
inode_dio_done(dio->inode);
if (is_async)
- aio_complete(dio->iocb, ret, 0);
+ aio_complete_batch(dio->iocb, ret, 0, batch);
}

return ret;
@@ -274,7 +275,8 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio);
/*
* Asynchronous IO callback.
*/
-static void dio_bio_end_aio(struct bio *bio, int error)
+static void dio_bio_end_aio(struct bio *bio, int error,
+ struct batch_complete *batch)
{
struct dio *dio = bio->bi_private;
unsigned long remaining;
@@ -290,7 +292,7 @@ static void dio_bio_end_aio(struct bio *bio, int error)
spin_unlock_irqrestore(&dio->bio_lock, flags);

if (remaining == 0) {
- dio_complete(dio, dio->iocb->ki_pos, 0, true);
+ dio_complete(dio, dio->iocb->ki_pos, 0, true, batch);
kmem_cache_free(dio_cache, dio);
}
}
@@ -1265,7 +1267,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
dio_await_completion(dio);

if (drop_refcount(dio) == 0) {
- retval = dio_complete(dio, offset, retval, false);
+ retval = dio_complete(dio, offset, retval, false, NULL);
kmem_cache_free(dio_cache, dio);
} else
BUG_ON(retval != -EIOCBQUEUED);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..a6fe048 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -6,11 +6,12 @@
#include <linux/aio_abi.h>
#include <linux/uio.h>
#include <linux/rcupdate.h>
-
#include <linux/atomic.h>
+#include <linux/batch_complete.h>

struct kioctx;
struct kiocb;
+struct batch_complete;

#define KIOCB_KEY 0

@@ -30,6 +31,8 @@ struct kiocb;
typedef int (kiocb_cancel_fn)(struct kiocb *);

struct kiocb {
+ struct kiocb *ki_next; /* batch completion */
+
struct file *ki_filp;
struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
@@ -41,6 +44,9 @@ struct kiocb {
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
+ long ki_res;
+ long ki_res2;
+
loff_t ki_pos;
size_t ki_nbytes; /* copy of iocb->aio_nbytes */

@@ -71,7 +77,9 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
/* prototypes */
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
-extern void aio_complete(struct kiocb *iocb, long res, long res2);
+extern void batch_complete_aio(struct batch_complete *batch);
+extern void aio_complete_batch(struct kiocb *iocb, long res, long res2,
+ struct batch_complete *batch);
struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
@@ -79,7 +87,12 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
-static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
+static inline void batch_complete_aio(struct batch_complete *batch) { }
+static inline void aio_complete_batch(struct kiocb *iocb, long res, long res2,
+ struct batch_complete *batch)
+{
+ return;
+}
struct mm_struct;
static inline void exit_aio(struct mm_struct *mm) { }
static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -89,6 +102,11 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
kiocb_cancel_fn *cancel) { }
#endif /* CONFIG_AIO */

+static inline void aio_complete(struct kiocb *iocb, long res, long res2)
+{
+ aio_complete_batch(iocb, res, res2, NULL);
+}
+
static inline struct kiocb *list_kiocb(struct list_head *h)
{
return list_entry(h, struct kiocb, ki_list);
diff --git a/include/linux/batch_complete.h b/include/linux/batch_complete.h
new file mode 100644
index 0000000..298baeb
--- /dev/null
+++ b/include/linux/batch_complete.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_BATCH_COMPLETE_H
+#define _LINUX_BATCH_COMPLETE_H
+
+/*
+ * Common stuff to the aio and block code for batch completion. Everything
+ * important is elsewhere:
+ */
+
+struct bio;
+struct kiocb;
+
+struct bio_list {
+ struct bio *head;
+ struct bio *tail;
+};
+
+struct batch_complete {
+ struct bio_list bio;
+ struct kiocb *kiocb;
+};
+
+#endif
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7f3089f..1c72bfa 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -24,6 +24,7 @@
#include <linux/mempool.h>
#include <linux/ioprio.h>
#include <linux/bug.h>
+#include <linux/batch_complete.h>

#ifdef CONFIG_BLOCK

@@ -69,6 +70,8 @@
#define bio_sectors(bio) ((bio)->bi_size >> 9)
#define bio_end_sector(bio) ((bio)->bi_sector + bio_sectors((bio)))

+void bio_endio_batch(struct bio *bio, int error, struct batch_complete *batch);
+
static inline unsigned int bio_cur_bytes(struct bio *bio)
{
if (bio->bi_vcnt)
@@ -252,7 +255,25 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)

}

-extern void bio_endio(struct bio *, int);
+/**
+ * bio_endio - end I/O on a bio
+ * @bio: bio
+ * @error: error, if any
+ *
+ * Description:
+ * bio_endio() will end I/O on the whole bio. bio_endio() is the
+ * preferred way to end I/O on a bio, it takes care of clearing
+ * BIO_UPTODATE on error. @error is 0 on success, and and one of the
+ * established -Exxxx (-EIO, for instance) error values in case
+ * something went wrong. No one should call bi_end_io() directly on a
+ * bio unless they own it and thus know that it has an end_io
+ * function.
+ **/
+static inline void bio_endio(struct bio *bio, int error)
+{
+ bio_endio_batch(bio, error, NULL);
+}
+
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);

@@ -404,10 +425,6 @@ static inline bool bio_mergeable(struct bio *bio)
* member of the bio. The bio_list also caches the last list member to allow
* fast access to the tail.
*/
-struct bio_list {
- struct bio *head;
- struct bio *tail;
-};

static inline int bio_list_empty(const struct bio_list *bl)
{
@@ -554,6 +571,15 @@ struct biovec_slab {
*/
#define BIO_SPLIT_ENTRIES 2

+static inline void batch_complete_init(struct batch_complete *batch)
+{
+ bio_list_init(&batch->bio);
+ batch->kiocb = NULL;
+}
+
+void batch_complete(struct batch_complete *batch);
+
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index b3195e3..9d3cafa 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -43,6 +43,7 @@ struct bio {
* top bits priority
*/

+ short bi_error;
unsigned short bi_vcnt; /* how many bio_vec's */
unsigned short bi_idx; /* current index into bvl_vec */

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2fdb4a4..ddc2f80 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -883,7 +883,8 @@ extern struct request *blk_fetch_request(struct request_queue *q);
* This prevents code duplication in drivers.
*/
extern bool blk_update_request(struct request *rq, int error,
- unsigned int nr_bytes);
+ unsigned int nr_bytes,
+ struct batch_complete *batch);
extern bool blk_end_request(struct request *rq, int error,
unsigned int nr_bytes);
extern void blk_end_request_all(struct request *rq, int error);
@@ -891,10 +892,17 @@ extern bool blk_end_request_cur(struct request *rq, int error);
extern bool blk_end_request_err(struct request *rq, int error);
extern bool __blk_end_request(struct request *rq, int error,
unsigned int nr_bytes);
-extern void __blk_end_request_all(struct request *rq, int error);
extern bool __blk_end_request_cur(struct request *rq, int error);
extern bool __blk_end_request_err(struct request *rq, int error);

+extern void blk_end_request_all_batch(struct request *rq, int error,
+ struct batch_complete *batch);
+
+static inline void __blk_end_request_all(struct request *rq, int error)
+{
+ blk_end_request_all_batch(rq, error, NULL);
+}
+
extern void blk_complete_request(struct request *);
extern void __blk_complete_request(struct request *);
extern void blk_abort_request(struct request *);
--
1.8.2.1

2013-05-14 01:22:04

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 15/21] virtio-blk: convert to batch completion

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
drivers/block/virtio_blk.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 6472395..49d0ec2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -217,7 +217,8 @@ static void virtblk_bio_send_flush_work(struct work_struct *work)
virtblk_bio_send_flush(vbr);
}

-static inline void virtblk_request_done(struct virtblk_req *vbr)
+static inline void virtblk_request_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;
struct request *req = vbr->req;
@@ -231,11 +232,12 @@ static inline void virtblk_request_done(struct virtblk_req *vbr)
req->errors = (error != 0);
}

- __blk_end_request_all(req, error);
+ blk_end_request_all_batch(req, error, batch);
mempool_free(vbr, vblk->pool);
}

-static inline void virtblk_bio_flush_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_flush_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;

@@ -244,12 +246,13 @@ static inline void virtblk_bio_flush_done(struct virtblk_req *vbr)
INIT_WORK(&vbr->work, virtblk_bio_send_data_work);
queue_work(virtblk_wq, &vbr->work);
} else {
- bio_endio(vbr->bio, virtblk_result(vbr));
+ bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
}

-static inline void virtblk_bio_data_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_data_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
struct virtio_blk *vblk = vbr->vblk;

@@ -259,17 +262,18 @@ static inline void virtblk_bio_data_done(struct virtblk_req *vbr)
INIT_WORK(&vbr->work, virtblk_bio_send_flush_work);
queue_work(virtblk_wq, &vbr->work);
} else {
- bio_endio(vbr->bio, virtblk_result(vbr));
+ bio_endio_batch(vbr->bio, virtblk_result(vbr), batch);
mempool_free(vbr, vblk->pool);
}
}

-static inline void virtblk_bio_done(struct virtblk_req *vbr)
+static inline void virtblk_bio_done(struct virtblk_req *vbr,
+ struct batch_complete *batch)
{
if (unlikely(vbr->flags & VBLK_IS_FLUSH))
- virtblk_bio_flush_done(vbr);
+ virtblk_bio_flush_done(vbr, batch);
else
- virtblk_bio_data_done(vbr);
+ virtblk_bio_data_done(vbr, batch);
}

static void virtblk_done(struct virtqueue *vq)
@@ -279,16 +283,19 @@ static void virtblk_done(struct virtqueue *vq)
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
+ struct batch_complete batch;
+
+ batch_complete_init(&batch);

spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
do {
virtqueue_disable_cb(vq);
while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
if (vbr->bio) {
- virtblk_bio_done(vbr);
+ virtblk_bio_done(vbr, &batch);
bio_done = true;
} else {
- virtblk_request_done(vbr);
+ virtblk_request_done(vbr, &batch);
req_done = true;
}
}
@@ -298,6 +305,8 @@ static void virtblk_done(struct virtqueue *vq)
blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);

+ batch_complete(&batch);
+
if (bio_done)
wake_up(&vblk->queue_wait);
}
--
1.8.2.1

2013-05-14 01:23:21

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 11/21] aio: Kill ki_dtor

sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
---
fs/aio.c | 2 --
include/linux/aio.h | 1 -
net/socket.c | 13 ++-----------
3 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 40781ff..7ce3cd8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -619,8 +619,6 @@ static void kiocb_free(struct kiocb *req)
fput(req->ki_filp);
if (req->ki_eventfd != NULL)
eventfd_ctx_put(req->ki_eventfd);
- if (req->ki_dtor)
- req->ki_dtor(req);
kmem_cache_free(kiocb_cachep, req);
}

diff --git a/include/linux/aio.h b/include/linux/aio.h
index c4f07ff..d9c92da 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -33,7 +33,6 @@ struct kiocb {
struct file *ki_filp;
struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
- void (*ki_dtor)(struct kiocb *);
void *private;

union {
diff --git a/net/socket.c b/net/socket.c
index bfe9fab..fc3bf4c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -848,11 +848,6 @@ int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
}
EXPORT_SYMBOL(kernel_recvmsg);

-static void sock_aio_dtor(struct kiocb *iocb)
-{
- kfree(iocb->private);
-}
-
static ssize_t sock_sendpage(struct file *file, struct page *page,
int offset, size_t size, loff_t *ppos, int more)
{
@@ -883,12 +878,8 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb,
struct sock_iocb *siocb)
{
- if (!is_sync_kiocb(iocb)) {
- siocb = kmalloc(sizeof(*siocb), GFP_KERNEL);
- if (!siocb)
- return NULL;
- iocb->ki_dtor = sock_aio_dtor;
- }
+ if (!is_sync_kiocb(iocb))
+ BUG();

siocb->kiocb = iocb;
iocb->private = siocb;
--
1.8.2.1

2013-05-14 01:23:40

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 10/21] aio: Kill ki_users

The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
---
fs/aio.c | 47 ++++++++++++-----------------------------------
include/linux/aio.h | 5 -----
2 files changed, 12 insertions(+), 40 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 280b014..40781ff 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -262,7 +262,6 @@ EXPORT_SYMBOL(kiocb_set_cancel_fn);
static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb)
{
kiocb_cancel_fn *old, *cancel;
- int ret = -EINVAL;

/*
* Don't want to set kiocb->ki_cancel = KIOCB_CANCELLED unless it
@@ -272,21 +271,13 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb)
cancel = ACCESS_ONCE(kiocb->ki_cancel);
do {
if (!cancel || cancel == KIOCB_CANCELLED)
- return ret;
+ return -EINVAL;

old = cancel;
cancel = cmpxchg(&kiocb->ki_cancel, old, KIOCB_CANCELLED);
} while (cancel != old);

- atomic_inc(&kiocb->ki_users);
- spin_unlock_irq(&ctx->ctx_lock);
-
- ret = cancel(kiocb);
-
- spin_lock_irq(&ctx->ctx_lock);
- aio_put_req(kiocb);
-
- return ret;
+ return cancel(kiocb);
}

static void free_ioctx_rcu(struct rcu_head *head)
@@ -510,16 +501,16 @@ static void kill_ioctx(struct kioctx *ctx)
/* wait_on_sync_kiocb:
* Waits on the given sync kiocb to complete.
*/
-ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
+ssize_t wait_on_sync_kiocb(struct kiocb *req)
{
- while (atomic_read(&iocb->ki_users)) {
+ while (!req->ki_ctx) {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (!atomic_read(&iocb->ki_users))
+ if (req->ki_ctx)
break;
io_schedule();
}
__set_current_state(TASK_RUNNING);
- return iocb->ki_user_data;
+ return req->ki_user_data;
}
EXPORT_SYMBOL(wait_on_sync_kiocb);

@@ -601,14 +592,8 @@ out:
}

/* aio_get_req
- * Allocate a slot for an aio request. Increments the ki_users count
- * of the kioctx so that the kioctx stays around until all requests are
- * complete. Returns NULL if no requests are free.
- *
- * Returns with kiocb->ki_users set to 2. The io submit code path holds
- * an extra reference while submitting the i/o.
- * This prevents races between the aio code path referencing the
- * req (after submitting it) and aio_complete() freeing the req.
+ * Allocate a slot for an aio request.
+ * Returns NULL if no requests are free.
*/
static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
@@ -621,7 +606,6 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
goto out_put;

- atomic_set(&req->ki_users, 1);
req->ki_ctx = ctx;
return req;
out_put:
@@ -640,13 +624,6 @@ static void kiocb_free(struct kiocb *req)
kmem_cache_free(kiocb_cachep, req);
}

-void aio_put_req(struct kiocb *req)
-{
- if (atomic_dec_and_test(&req->ki_users))
- kiocb_free(req);
-}
-EXPORT_SYMBOL(aio_put_req);
-
static struct kioctx *lookup_ioctx(unsigned long ctx_id)
{
struct mm_struct *mm = current->mm;
@@ -685,9 +662,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
* - the sync task helpfully left a reference to itself in the iocb
*/
if (is_sync_kiocb(iocb)) {
- BUG_ON(atomic_read(&iocb->ki_users) != 1);
iocb->ki_user_data = res;
- atomic_set(&iocb->ki_users, 0);
+ smp_wmb();
+ iocb->ki_ctx = ERR_PTR(-EXDEV);
wake_up_process(iocb->ki_obj.tsk);
return;
}
@@ -759,7 +736,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
eventfd_signal(iocb->ki_eventfd, 1);

/* everything turned out well, dispose of the aiocb. */
- aio_put_req(iocb);
+ kiocb_free(iocb);

/*
* We have to order our ring_info tail store above and test
@@ -1183,7 +1160,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return 0;
out_put_req:
put_reqs_available(ctx, 1);
- aio_put_req(req);
+ kiocb_free(req);
return ret;
}

diff --git a/include/linux/aio.h b/include/linux/aio.h
index b570472..c4f07ff 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -30,8 +30,6 @@ struct kiocb;
typedef int (kiocb_cancel_fn)(struct kiocb *);

struct kiocb {
- atomic_t ki_users;
-
struct file *ki_filp;
struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
@@ -65,7 +63,6 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
*kiocb = (struct kiocb) {
- .ki_users = ATOMIC_INIT(1),
.ki_ctx = NULL,
.ki_filp = filp,
.ki_obj.tsk = current,
@@ -75,7 +72,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
/* prototypes */
#ifdef CONFIG_AIO
extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
-extern void aio_put_req(struct kiocb *iocb);
extern void aio_complete(struct kiocb *iocb, long res, long res2);
struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
@@ -84,7 +80,6 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
-static inline void aio_put_req(struct kiocb *iocb) { }
static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
struct mm_struct;
static inline void exit_aio(struct mm_struct *mm) { }
--
1.8.2.1

2013-05-14 01:19:17

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 07/21] aio: Don't use ctx->tail unnecessarily

aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.

So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
fs/aio.c | 41 +++++++++++++++++++++++------------------
1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5e1b801..2c9a5ac 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -306,7 +306,8 @@ static void free_ioctx(struct kioctx *ctx)
{
struct aio_ring *ring;
struct kiocb *req;
- unsigned cpu, head, avail;
+ unsigned cpu, avail;
+ DEFINE_WAIT(wait);

spin_lock_irq(&ctx->ctx_lock);

@@ -327,21 +328,24 @@ static void free_ioctx(struct kioctx *ctx)
kcpu->reqs_available = 0;
}

- ring = kmap_atomic(ctx->ring_pages[0]);
- head = ring->head;
- kunmap_atomic(ring);
+ while (1) {
+ prepare_to_wait(&ctx->wait, &wait, TASK_UNINTERRUPTIBLE);

- while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
- wait_event(ctx->wait,
- (head != ctx->tail) ||
- (atomic_read(&ctx->reqs_available) >= ctx->nr_events - 1));
-
- avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;
+ ring = kmap_atomic(ctx->ring_pages[0]);
+ avail = (ring->head <= ring->tail)
+ ? ring->tail - ring->head
+ : ctx->nr_events - ring->head + ring->tail;

atomic_add(avail, &ctx->reqs_available);
- head += avail;
- head %= ctx->nr_events;
+ ring->head = ring->tail;
+ kunmap_atomic(ring);
+
+ if (atomic_read(&ctx->reqs_available) >= ctx->nr_events - 1)
+ break;
+
+ schedule();
}
+ finish_wait(&ctx->wait, &wait);

WARN_ON(atomic_read(&ctx->reqs_available) > ctx->nr_events - 1);

@@ -782,7 +786,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
struct io_event __user *event, long nr)
{
struct aio_ring *ring;
- unsigned head, pos;
+ unsigned head, tail, pos;
long ret = 0;
int copy_ret;

@@ -790,11 +794,12 @@ static long aio_read_events_ring(struct kioctx *ctx,

ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
+ tail = ring->tail;
kunmap_atomic(ring);

- pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);
+ pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);

- if (head == ctx->tail)
+ if (head == tail)
goto out;

while (ret < nr) {
@@ -802,8 +807,8 @@ static long aio_read_events_ring(struct kioctx *ctx,
struct io_event *ev;
struct page *page;

- avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;
- if (head == ctx->tail)
+ avail = (head <= tail ? tail : ctx->nr_events) - head;
+ if (head == tail)
break;

avail = min(avail, nr - ret);
@@ -834,7 +839,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);

- pr_debug("%li h%u t%u\n", ret, head, ctx->tail);
+ pr_debug("%li h%u t%u\n", ret, head, tail);

put_reqs_available(ctx, ret);
out:
--
1.8.2.1

2013-05-14 01:24:07

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 09/21] aio: Kill unneeded kiocb members

The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Theodore Ts'o <[email protected]>
---
fs/aio.c | 69 +++++++++++++++++++++++++++++++----------------------
include/linux/aio.h | 11 ++-------
2 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 73ec062..280b014 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -637,8 +637,6 @@ static void kiocb_free(struct kiocb *req)
eventfd_ctx_put(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
- if (req->ki_iovec != &req->ki_inline_vec)
- kfree(req->ki_iovec);
kmem_cache_free(kiocb_cachep, req);
}

@@ -968,24 +966,26 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);

-static ssize_t aio_setup_vectored_rw(int rw, struct kiocb *kiocb, bool compat)
+static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb,
+ int rw, char __user *buf,
+ unsigned long *nr_segs,
+ struct iovec **iovec,
+ bool compat)
{
ssize_t ret;

- kiocb->ki_nr_segs = kiocb->ki_nbytes;
+ *nr_segs = kiocb->ki_nbytes;

#ifdef CONFIG_COMPAT
if (compat)
ret = compat_rw_copy_check_uvector(rw,
- (struct compat_iovec __user *)kiocb->ki_buf,
- kiocb->ki_nr_segs, 1, &kiocb->ki_inline_vec,
- &kiocb->ki_iovec);
+ (struct compat_iovec __user *)buf,
+ *nr_segs, 1, *iovec, iovec);
else
#endif
ret = rw_copy_check_uvector(rw,
- (struct iovec __user *)kiocb->ki_buf,
- kiocb->ki_nr_segs, 1, &kiocb->ki_inline_vec,
- &kiocb->ki_iovec);
+ (struct iovec __user *)buf,
+ *nr_segs, 1, *iovec, iovec);
if (ret < 0)
return ret;

@@ -994,15 +994,17 @@ static ssize_t aio_setup_vectored_rw(int rw, struct kiocb *kiocb, bool compat)
return 0;
}

-static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
+static ssize_t aio_setup_single_vector(struct kiocb *kiocb,
+ int rw, char __user *buf,
+ unsigned long *nr_segs,
+ struct iovec *iovec)
{
- if (unlikely(!access_ok(!rw, kiocb->ki_buf, kiocb->ki_nbytes)))
+ if (unlikely(!access_ok(!rw, buf, kiocb->ki_nbytes)))
return -EFAULT;

- kiocb->ki_iovec = &kiocb->ki_inline_vec;
- kiocb->ki_iovec->iov_base = kiocb->ki_buf;
- kiocb->ki_iovec->iov_len = kiocb->ki_nbytes;
- kiocb->ki_nr_segs = 1;
+ iovec->iov_base = buf;
+ iovec->iov_len = kiocb->ki_nbytes;
+ *nr_segs = 1;
return 0;
}

@@ -1011,15 +1013,18 @@ static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
* Performs the initial checks and aio retry method
* setup for the kiocb at the time of io submission.
*/
-static ssize_t aio_run_iocb(struct kiocb *req, bool compat)
+static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
+ char __user *buf, bool compat)
{
struct file *file = req->ki_filp;
ssize_t ret;
+ unsigned long nr_segs;
int rw;
fmode_t mode;
aio_rw_op *rw_op;
+ struct iovec inline_vec, *iovec = &inline_vec;

- switch (req->ki_opcode) {
+ switch (opcode) {
case IOCB_CMD_PREAD:
case IOCB_CMD_PREADV:
mode = FMODE_READ;
@@ -1040,16 +1045,21 @@ rw_common:
if (!rw_op)
return -EINVAL;

- ret = (req->ki_opcode == IOCB_CMD_PREADV ||
- req->ki_opcode == IOCB_CMD_PWRITEV)
- ? aio_setup_vectored_rw(rw, req, compat)
- : aio_setup_single_vector(rw, req);
+ ret = (opcode == IOCB_CMD_PREADV ||
+ opcode == IOCB_CMD_PWRITEV)
+ ? aio_setup_vectored_rw(req, rw, buf, &nr_segs,
+ &iovec, compat)
+ : aio_setup_single_vector(req, rw, buf, &nr_segs,
+ iovec);
if (ret)
return ret;

ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes);
- if (ret < 0)
+ if (ret < 0) {
+ if (iovec != &inline_vec)
+ kfree(iovec);
return ret;
+ }

req->ki_nbytes = ret;

@@ -1063,8 +1073,7 @@ rw_common:
if (rw == WRITE)
file_start_write(file);

- ret = rw_op(req, req->ki_iovec,
- req->ki_nr_segs, req->ki_pos);
+ ret = rw_op(req, iovec, nr_segs, req->ki_pos);

if (rw == WRITE)
file_end_write(file);
@@ -1089,6 +1098,9 @@ rw_common:
return -EINVAL;
}

+ if (iovec != &inline_vec)
+ kfree(iovec);
+
if (ret != -EIOCBQUEUED) {
/*
* There's no easy way to restart the syscall since other AIO's
@@ -1160,12 +1172,11 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
req->ki_obj.user = user_iocb;
req->ki_user_data = iocb->aio_data;
req->ki_pos = iocb->aio_offset;
-
- req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
req->ki_nbytes = iocb->aio_nbytes;
- req->ki_opcode = iocb->aio_lio_opcode;

- ret = aio_run_iocb(req, compat);
+ ret = aio_run_iocb(req, iocb->aio_lio_opcode,
+ (char __user *)(unsigned long)iocb->aio_buf,
+ compat);
if (ret)
goto out_put_req;

diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7bb766e..b570472 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -36,6 +36,7 @@ struct kiocb {
struct kioctx *ki_ctx; /* NULL for sync ops */
kiocb_cancel_fn *ki_cancel;
void (*ki_dtor)(struct kiocb *);
+ void *private;

union {
void __user *user;
@@ -44,15 +45,7 @@ struct kiocb {

__u64 ki_user_data; /* user's data for completion */
loff_t ki_pos;
-
- void *private;
- /* State that we remember to be able to restart/retry */
- unsigned short ki_opcode;
- size_t ki_nbytes; /* copy of iocb->aio_nbytes */
- char __user *ki_buf; /* remaining iocb->aio_buf */
- struct iovec ki_inline_vec; /* inline vector */
- struct iovec *ki_iovec;
- unsigned long ki_nr_segs;
+ size_t ki_nbytes; /* copy of iocb->aio_nbytes */

struct list_head ki_list; /* the aio core uses this
* for cancellation */
--
1.8.2.1

2013-05-14 01:19:12

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 02/21] aio: reqs_active -> reqs_available

The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.

So as prep work for that, we convert reqs_active to reqs_available.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index fe794af..bde41c1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -89,7 +89,13 @@ struct kioctx {
struct work_struct rcu_work;

struct {
- atomic_t reqs_active;
+ /*
+ * This counts the number of available slots in the ringbuffer,
+ * so we avoid overflowing it: it's decremented (if positive)
+ * when allocating a kiocb and incremented when the resulting
+ * io_event is pulled off the ringbuffer.
+ */
+ atomic_t reqs_available;
} ____cacheline_aligned_in_smp;

struct {
@@ -306,19 +312,19 @@ static void free_ioctx(struct kioctx *ctx)
head = ring->head;
kunmap_atomic(ring);

- while (atomic_read(&ctx->reqs_active) > 0) {
+ while (atomic_read(&ctx->reqs_available) < ctx->nr_events - 1) {
wait_event(ctx->wait,
(head != ctx->tail) ||
- (atomic_read(&ctx->reqs_active) <= 0);
+ (atomic_read(&ctx->reqs_available) >= ctx->nr_events - 1));

avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;

- atomic_sub(avail, &ctx->reqs_active);
+ atomic_add(avail, &ctx->reqs_available);
head += avail;
head %= ctx->nr_events;
}

- WARN_ON(atomic_read(&ctx->reqs_active) < 0);
+ WARN_ON(atomic_read(&ctx->reqs_available) > ctx->nr_events - 1);

aio_free_ring(ctx);

@@ -382,6 +388,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (aio_setup_ring(ctx) < 0)
goto out_freectx;

+ atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
+
/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
if (aio_nr + nr_events > aio_max_nr ||
@@ -484,7 +492,7 @@ void exit_aio(struct mm_struct *mm)
"exit_aio:ioctx still alive: %d %d %d\n",
atomic_read(&ctx->users),
atomic_read(&ctx->dead),
- atomic_read(&ctx->reqs_active));
+ atomic_read(&ctx->reqs_available));
/*
* We don't need to bother with munmap() here -
* exit_mmap(mm) is coming and it'll unmap everything.
@@ -516,12 +524,9 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;

- if (atomic_read(&ctx->reqs_active) >= ctx->nr_events)
+ if (atomic_dec_if_positive(&ctx->reqs_available) <= 0)
return NULL;

- if (atomic_inc_return(&ctx->reqs_active) > ctx->nr_events - 1)
- goto out_put;
-
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
goto out_put;
@@ -531,7 +536,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)

return req;
out_put:
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
return NULL;
}

@@ -602,7 +607,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

/*
* Take rcu_read_lock() in case the kioctx is being destroyed, as we
- * need to issue a wakeup after decrementing reqs_active.
+ * need to issue a wakeup after incrementing reqs_available.
*/
rcu_read_lock();

@@ -620,7 +625,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
if (unlikely(xchg(&iocb->ki_cancel,
KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
}
@@ -758,7 +763,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li h%u t%u\n", ret, head, ctx->tail);

- atomic_sub(ret, &ctx->reqs_active);
+ atomic_add(ret, &ctx->reqs_available);
out:
mutex_unlock(&ctx->ring_lock);

@@ -1142,7 +1147,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
aio_put_req(req); /* drop extra ref to req */
return 0;
out_put_req:
- atomic_dec(&ctx->reqs_active);
+ atomic_inc(&ctx->reqs_available);
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
--
1.8.2.1

2013-05-14 01:24:30

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 06/21] aio: io_cancel() no longer returns the io_event

Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.

But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.

This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.

Also tweak the refcounting in kiocb_cancel() to make more sense.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
---
drivers/usb/gadget/inode.c | 3 +--
fs/aio.c | 40 ++++++++++------------------------------
include/linux/aio.h | 2 +-
3 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 570c005..e02c1e0 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -524,7 +524,7 @@ struct kiocb_priv {
unsigned actual;
};

-static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
+static int ep_aio_cancel(struct kiocb *iocb)
{
struct kiocb_priv *priv = iocb->private;
struct ep_data *epdata;
@@ -540,7 +540,6 @@ static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
// spin_unlock(&epdata->dev->lock);
local_irq_enable();

- aio_put_req(iocb);
return value;
}

diff --git a/fs/aio.c b/fs/aio.c
index 93383b0..5e1b801 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -259,8 +259,7 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel)
}
EXPORT_SYMBOL(kiocb_set_cancel_fn);

-static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
- struct io_event *res)
+static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb)
{
kiocb_cancel_fn *old, *cancel;
int ret = -EINVAL;
@@ -282,12 +281,10 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
atomic_inc(&kiocb->ki_users);
spin_unlock_irq(&ctx->ctx_lock);

- memset(res, 0, sizeof(*res));
- res->obj = (u64)(unsigned long)kiocb->ki_obj.user;
- res->data = kiocb->ki_user_data;
- ret = cancel(kiocb, res);
+ ret = cancel(kiocb);

spin_lock_irq(&ctx->ctx_lock);
+ aio_put_req(kiocb);

return ret;
}
@@ -308,7 +305,6 @@ static void free_ioctx_rcu(struct rcu_head *head)
static void free_ioctx(struct kioctx *ctx)
{
struct aio_ring *ring;
- struct io_event res;
struct kiocb *req;
unsigned cpu, head, avail;

@@ -319,7 +315,7 @@ static void free_ioctx(struct kioctx *ctx)
struct kiocb, ki_list);

list_del_init(&req->ki_list);
- kiocb_cancel(ctx, req, &res);
+ kiocb_cancel(ctx, req);
}

spin_unlock_irq(&ctx->ctx_lock);
@@ -709,21 +705,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
}

/*
- * cancelled requests don't get events, userland was given one
- * when the event got cancelled.
- */
- if (unlikely(xchg(&iocb->ki_cancel,
- KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
- /*
- * Can't use the percpu reqs_available here - could race with
- * free_ioctx()
- */
- atomic_inc(&ctx->reqs_available);
- /* Still need the wake_up in case free_ioctx is waiting */
- goto put_rq;
- }
-
- /*
* Add a completion event to the ring buffer. Must be done holding
* ctx->ctx_lock to prevent other code from messing with the tail
* pointer since we might be called from irq context.
@@ -775,7 +756,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
if (iocb->ki_eventfd != NULL)
eventfd_signal(iocb->ki_eventfd, 1);

-put_rq:
/* everything turned out well, dispose of the aiocb. */
aio_put_req(iocb);

@@ -1352,7 +1332,6 @@ static struct kiocb *lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb,
SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
struct io_event __user *, result)
{
- struct io_event res;
struct kioctx *ctx;
struct kiocb *kiocb;
u32 key;
@@ -1370,18 +1349,19 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,

kiocb = lookup_kiocb(ctx, iocb, key);
if (kiocb)
- ret = kiocb_cancel(ctx, kiocb, &res);
+ ret = kiocb_cancel(ctx, kiocb);
else
ret = -EINVAL;

spin_unlock_irq(&ctx->ctx_lock);

if (!ret) {
- /* Cancellation succeeded -- copy the result
- * into the user's buffer.
+ /*
+ * The result argument is no longer used - the io_event is
+ * always delivered via the ring buffer. -EINPROGRESS indicates
+ * cancellation is progress:
*/
- if (copy_to_user(result, &res, sizeof(res)))
- ret = -EFAULT;
+ ret = -EINPROGRESS;
}

put_ioctx(ctx);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 1bdf965..8c8dd1d 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -27,7 +27,7 @@ struct kiocb;
*/
#define KIOCB_CANCELLED ((void *) (~0ULL))

-typedef int (kiocb_cancel_fn)(struct kiocb *, struct io_event *);
+typedef int (kiocb_cancel_fn)(struct kiocb *);

struct kiocb {
atomic_t ki_users;
--
1.8.2.1

2013-05-14 01:24:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 05/21] aio: percpu ioctx refcount

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c341cee..93383b0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -36,6 +36,7 @@
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
+#include <linux/percpu-refcount.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -65,8 +66,7 @@ struct kioctx_cpu {
};

struct kioctx {
- atomic_t users;
- atomic_t dead;
+ struct percpu_ref users;

/* This needs improving */
unsigned long user_id;
@@ -370,7 +370,7 @@ static void free_ioctx(struct kioctx *ctx)

static void put_ioctx(struct kioctx *ctx)
{
- if (unlikely(atomic_dec_and_test(&ctx->users)))
+ if (percpu_ref_put(&ctx->users))
free_ioctx(ctx);
}

@@ -411,8 +411,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

ctx->max_reqs = nr_events;

- atomic_set(&ctx->users, 2);
- atomic_set(&ctx->dead, 0);
+ if (percpu_ref_init(&ctx->users))
+ goto out_freectx;
+
+ rcu_read_lock();
+ percpu_ref_get(&ctx->users);
+ rcu_read_unlock();
+
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->completion_lock);
mutex_init(&ctx->ring_lock);
@@ -422,7 +427,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

ctx->cpu = alloc_percpu(struct kioctx_cpu);
if (!ctx->cpu)
- goto out_freectx;
+ goto out_freeref;

if (aio_setup_ring(ctx) < 0)
goto out_freepcpu;
@@ -455,6 +460,8 @@ out_cleanup:
aio_free_ring(ctx);
out_freepcpu:
free_percpu(ctx->cpu);
+out_freeref:
+ free_percpu(ctx->users.pcpu_count);
out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
@@ -484,7 +491,7 @@ static void kill_ioctx_rcu(struct rcu_head *head)
*/
static void kill_ioctx(struct kioctx *ctx)
{
- if (!atomic_xchg(&ctx->dead, 1)) {
+ if (percpu_ref_kill(&ctx->users)) {
hlist_del_rcu(&ctx->list);
/* Between hlist_del_rcu() and dropping the initial ref */
synchronize_rcu();
@@ -530,12 +537,6 @@ void exit_aio(struct mm_struct *mm)
struct hlist_node *n;

hlist_for_each_entry_safe(ctx, n, &mm->ioctx_list, list) {
- if (1 != atomic_read(&ctx->users))
- printk(KERN_DEBUG
- "exit_aio:ioctx still alive: %d %d %d\n",
- atomic_read(&ctx->users),
- atomic_read(&ctx->dead),
- atomic_read(&ctx->reqs_available));
/*
* We don't need to bother with munmap() here -
* exit_mmap(mm) is coming and it'll unmap everything.
@@ -546,7 +547,7 @@ void exit_aio(struct mm_struct *mm)
*/
ctx->mmap_size = 0;

- if (!atomic_xchg(&ctx->dead, 1)) {
+ if (percpu_ref_kill(&ctx->users)) {
hlist_del_rcu(&ctx->list);
call_rcu(&ctx->rcu_head, kill_ioctx_rcu);
}
@@ -657,7 +658,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)

hlist_for_each_entry_rcu(ctx, &mm->ioctx_list, list) {
if (ctx->user_id == ctx_id) {
- atomic_inc(&ctx->users);
+ percpu_ref_get(&ctx->users);
ret = ctx;
break;
}
@@ -870,7 +871,7 @@ static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
if (ret > 0)
*i += ret;

- if (unlikely(atomic_read(&ctx->dead)))
+ if (unlikely(percpu_ref_dead(&ctx->users)))
ret = -EINVAL;

if (!*i)
--
1.8.2.1

2013-05-14 01:19:09

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 01/21] aio: fix kioctx not being freed after cancellation at exit time

From: Benjamin LaHaise <[email protected]>

The recent changes overhauling fs/aio.c introduced a bug that results in the
kioctx not being freed when outstanding kiocbs are cancelled at exit_aio()
time. Specifically, a kiocb that is cancelled has its completion events
discarded by batch_complete_aio(), which then fails to wake up the process
stuck in free_ioctx(). Fix this by modifying the wait_event() condition
in free_ioctx() appropriately.

This patch was tested with the cancel operation in the thread based code
posted yesterday.

Signed-off-by: Benjamin LaHaise <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Josh Boyer <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/aio.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index c5b1a8c..fe794af 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -307,7 +307,9 @@ static void free_ioctx(struct kioctx *ctx)
kunmap_atomic(ring);

while (atomic_read(&ctx->reqs_active) > 0) {
- wait_event(ctx->wait, head != ctx->tail);
+ wait_event(ctx->wait,
+ (head != ctx->tail) ||
+ (atomic_read(&ctx->reqs_active) <= 0);

avail = (head <= ctx->tail ? ctx->tail : ctx->nr_events) - head;

--
1.8.2.1

2013-05-14 01:25:12

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 04/21] Generic percpu refcounting

This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test() - but percpu.

It also implements two stage shutdown, as we need it to tear down the
percpu counts. Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in "shutting down mode" and
switches back to a single atomic refcount with the appropriate barriers
(synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only returns
true once, so callers don't have to reimplement shutdown synchronization.

[[email protected]: fix build]
[[email protected]: coding-style tweak]
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/percpu-refcount.h | 118 +++++++++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/percpu-refcount.c | 140 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 259 insertions(+), 1 deletion(-)
create mode 100644 include/linux/percpu-refcount.h
create mode 100644 lib/percpu-refcount.c

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 0000000..5bd35c7
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,118 @@
+/*
+ * Dynamic percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet <[email protected]>
+ *
+ * This implements a refcount with similar semantics to atomic_t - atomic_inc(),
+ * atomic_dec_and_test() - but percpu.
+ *
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ *
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0. After it returns,
+ * it's safe to drop the initial ref.
+ *
+ * USAGE:
+ *
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ *
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu() and sychronize_rcu() to remove
+ * the kioctx from the proccess's list of kioctxs - after that, there can't be
+ * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
+ * the initial ref with percpu_ref_put().
+ *
+ * Code that does a two stage shutdown like this often needs some kind of
+ * explicit synchronization to ensure the initial refcount can only be dropped
+ * once - percpu_ref_kill() does this for you, it returns true once and false if
+ * someone else already called it. The aio code uses it this way, but it's not
+ * necessary if the code has some other mechanism to synchronize teardown.
+ * around.
+ */
+
+#ifndef _LINUX_PERCPU_REFCOUNT_H
+#define _LINUX_PERCPU_REFCOUNT_H
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+
+struct percpu_ref {
+ atomic_t count;
+ unsigned __percpu *pcpu_count;
+};
+
+int percpu_ref_init(struct percpu_ref *ref);
+int percpu_ref_tryget(struct percpu_ref *ref);
+int percpu_ref_put_initial_ref(struct percpu_ref *ref);
+
+/**
+ * percpu_ref_get - increment a dynamic percpu refcount
+ *
+ * Analagous to atomic_inc().
+ */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+ unsigned __percpu *pcpu_count;
+
+ preempt_disable();
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (pcpu_count)
+ __this_cpu_inc(*pcpu_count);
+ else
+ atomic_inc(&ref->count);
+
+ preempt_enable();
+}
+
+/**
+ * percpu_ref_put - decrement a dynamic percpu refcount
+ *
+ * Returns true if the result is 0, otherwise false; only checks for the ref
+ * hitting 0 after percpu_ref_kill() has been called. Analagous to
+ * atomic_dec_and_test().
+ */
+static inline int percpu_ref_put(struct percpu_ref *ref)
+{
+ unsigned __percpu *pcpu_count;
+ int ret = 0;
+
+ preempt_disable();
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (pcpu_count)
+ __this_cpu_dec(*pcpu_count);
+ else
+ ret = atomic_dec_and_test(&ref->count);
+
+ preempt_enable();
+
+ return ret;
+}
+
+unsigned percpu_ref_count(struct percpu_ref *ref);
+int percpu_ref_kill(struct percpu_ref *ref);
+
+/**
+ * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
+ *
+ * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
+ */
+static inline int percpu_ref_dead(struct percpu_ref *ref)
+{
+ return ref->pcpu_count == NULL;
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index e9c52e1..25a0ce1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o
+ earlycpio.o percpu-refcount.o

obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
new file mode 100644
index 0000000..4a0155b
--- /dev/null
+++ b/lib/percpu-refcount.c
@@ -0,0 +1,140 @@
+#define pr_fmt(fmt) "%s: " fmt "\n", __func__
+
+#include <linux/kernel.h>
+#include <linux/percpu-refcount.h>
+
+/*
+ * The trick to implementing percpu refcounts is shutdown. We can't detect the
+ * ref hitting 0 on every put - this would require global synchronization and
+ * defeat the whole purpose of using percpu refs.
+ *
+ * What we do is require the user to keep track of the initial refcount; we know
+ * the ref can't hit 0 before the user drops the initial ref, so as long as we
+ * convert to non percpu mode before the initial ref is dropped everything
+ * works.
+ *
+ * Converting to non percpu mode is done with some RCUish stuff in
+ * percpu_ref_kill. Additionally, we need a bias value so that the atomic_t
+ * can't hit 0 before we've added up all the percpu refs.
+ */
+
+#define PCPU_COUNT_BIAS (1ULL << 31)
+
+int percpu_ref_tryget(struct percpu_ref *ref)
+{
+ int ret = 1;
+
+ preempt_disable();
+
+ if (!percpu_ref_dead(ref))
+ percpu_ref_get(ref);
+ else
+ ret = 0;
+
+ preempt_enable();
+
+ return ret;
+}
+
+unsigned percpu_ref_count(struct percpu_ref *ref)
+{
+ unsigned __percpu *pcpu_count;
+ unsigned count = 0;
+ int cpu;
+
+ preempt_disable();
+
+ count = atomic_read(&ref->count);
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (pcpu_count)
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(pcpu_count, cpu);
+
+ preempt_enable();
+
+ return count;
+}
+
+/**
+ * percpu_ref_init - initialize a dynamic percpu refcount
+ *
+ * Initializes the refcount in single atomic counter mode with a refcount of 1;
+ * analagous to atomic_set(ref, 1).
+ */
+int percpu_ref_init(struct percpu_ref *ref)
+{
+ atomic_set(&ref->count, 1 + PCPU_COUNT_BIAS);
+
+ ref->pcpu_count = alloc_percpu(unsigned);
+ if (!ref->pcpu_count)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/**
+ * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
+ *
+ * Must be called before dropping the initial ref, so that percpu_ref_put()
+ * knows to check for the refcount hitting 0. If the refcount was in percpu
+ * mode, converts it back to single atomic counter mode.
+ *
+ * The caller must issue a synchronize_rcu()/call_rcu() before calling
+ * percpu_ref_put() to drop the initial ref.
+ *
+ * Returns true the first time called on @ref and false if @ref is already
+ * shutting down, so it may be used by the caller for synchronizing other parts
+ * of a two stage shutdown.
+ */
+int percpu_ref_kill(struct percpu_ref *ref)
+{
+ unsigned __percpu *pcpu_count;
+ unsigned __percpu *old;
+ unsigned count = 0;
+ int cpu;
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ do {
+ if (!pcpu_count)
+ return 0;
+
+ old = pcpu_count;
+ pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
+ } while (pcpu_count != old);
+
+ synchronize_sched();
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(pcpu_count, cpu);
+
+ free_percpu(pcpu_count);
+
+ pr_debug("global %lli pcpu %i",
+ (int64_t) atomic_read(&ref->count), (int) count);
+
+ atomic_add((int) count - PCPU_COUNT_BIAS, &ref->count);
+
+ return 1;
+}
+
+/**
+ * percpu_ref_put_initial_ref - safely drop the initial ref
+ *
+ * A percpu refcount needs a shutdown sequence before dropping the initial ref,
+ * to put it back into single atomic_t mode with the appropriate barriers so
+ * that percpu_ref_put() can safely check for it hitting 0 - this does so.
+ *
+ * Returns true if @ref hit 0.
+ */
+int percpu_ref_put_initial_ref(struct percpu_ref *ref)
+{
+ if (percpu_ref_kill(ref)) {
+ return percpu_ref_put(ref);
+ } else {
+ WARN_ON(1);
+ return 0;
+ }
+}
--
1.8.2.1

2013-05-14 01:25:37

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 03/21] aio: percpu reqs_available

See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Felipe Balbi <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Asai Thambi S P <[email protected]>
Cc: Selvan Mani <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Reviewed-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/aio.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 99 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index bde41c1..c341cee 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -26,6 +26,7 @@
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/mmu_context.h>
+#include <linux/percpu.h>
#include <linux/slab.h>
#include <linux/timer.h>
#include <linux/aio.h>
@@ -59,6 +60,10 @@ struct aio_ring {

#define AIO_RING_PAGES 8

+struct kioctx_cpu {
+ unsigned reqs_available;
+};
+
struct kioctx {
atomic_t users;
atomic_t dead;
@@ -67,6 +72,13 @@ struct kioctx {
unsigned long user_id;
struct hlist_node list;

+ struct __percpu kioctx_cpu *cpu;
+
+ /*
+ * For percpu reqs_available, number of slots we move to/from global
+ * counter at a time:
+ */
+ unsigned req_batch;
/*
* This is what userspace passed to io_setup(), it's not used for
* anything but counting against the global max_reqs quota.
@@ -94,6 +106,8 @@ struct kioctx {
* so we avoid overflowing it: it's decremented (if positive)
* when allocating a kiocb and incremented when the resulting
* io_event is pulled off the ringbuffer.
+ *
+ * We batch accesses to it with a percpu version.
*/
atomic_t reqs_available;
} ____cacheline_aligned_in_smp;
@@ -281,6 +295,8 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb,
static void free_ioctx_rcu(struct rcu_head *head)
{
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+
+ free_percpu(ctx->cpu);
kmem_cache_free(kioctx_cachep, ctx);
}

@@ -294,7 +310,7 @@ static void free_ioctx(struct kioctx *ctx)
struct aio_ring *ring;
struct io_event res;
struct kiocb *req;
- unsigned head, avail;
+ unsigned cpu, head, avail;

spin_lock_irq(&ctx->ctx_lock);

@@ -308,6 +324,13 @@ static void free_ioctx(struct kioctx *ctx)

spin_unlock_irq(&ctx->ctx_lock);

+ for_each_possible_cpu(cpu) {
+ struct kioctx_cpu *kcpu = per_cpu_ptr(ctx->cpu, cpu);
+
+ atomic_add(kcpu->reqs_available, &ctx->reqs_available);
+ kcpu->reqs_available = 0;
+ }
+
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
@@ -360,6 +383,18 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
struct kioctx *ctx;
int err = -ENOMEM;

+ /*
+ * We keep track of the number of available ringbuffer slots, to prevent
+ * overflow (reqs_available), and we also use percpu counters for this.
+ *
+ * So since up to half the slots might be on other cpu's percpu counters
+ * and unavailable, double nr_events so userspace sees what they
+ * expected: additionally, we move req_batch slots to/from percpu
+ * counters at a time, so make sure that isn't 0:
+ */
+ nr_events = max(nr_events, num_possible_cpus() * 4);
+ nr_events *= 2;
+
/* Prevent overflows */
if ((nr_events > (0x10000000U / sizeof(struct io_event))) ||
(nr_events > (0x10000000U / sizeof(struct kiocb)))) {
@@ -385,10 +420,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)

INIT_LIST_HEAD(&ctx->active_reqs);

- if (aio_setup_ring(ctx) < 0)
+ ctx->cpu = alloc_percpu(struct kioctx_cpu);
+ if (!ctx->cpu)
goto out_freectx;

+ if (aio_setup_ring(ctx) < 0)
+ goto out_freepcpu;
+
atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
+ ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
+ BUG_ON(!ctx->req_batch);

/* limit the number of system wide aios */
spin_lock(&aio_nr_lock);
@@ -412,6 +453,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
+out_freepcpu:
+ free_percpu(ctx->cpu);
out_freectx:
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
@@ -510,6 +553,52 @@ void exit_aio(struct mm_struct *mm)
}
}

+static void put_reqs_available(struct kioctx *ctx, unsigned nr)
+{
+ struct kioctx_cpu *kcpu;
+
+ preempt_disable();
+ kcpu = this_cpu_ptr(ctx->cpu);
+
+ kcpu->reqs_available += nr;
+ while (kcpu->reqs_available >= ctx->req_batch * 2) {
+ kcpu->reqs_available -= ctx->req_batch;
+ atomic_add(ctx->req_batch, &ctx->reqs_available);
+ }
+
+ preempt_enable();
+}
+
+static bool get_reqs_available(struct kioctx *ctx)
+{
+ struct kioctx_cpu *kcpu;
+ bool ret = false;
+
+ preempt_disable();
+ kcpu = this_cpu_ptr(ctx->cpu);
+
+ if (!kcpu->reqs_available) {
+ int old, avail = atomic_read(&ctx->reqs_available);
+
+ do {
+ if (avail < ctx->req_batch)
+ goto out;
+
+ old = avail;
+ avail = atomic_cmpxchg(&ctx->reqs_available,
+ avail, avail - ctx->req_batch);
+ } while (avail != old);
+
+ kcpu->reqs_available += ctx->req_batch;
+ }
+
+ ret = true;
+ kcpu->reqs_available--;
+out:
+ preempt_enable();
+ return ret;
+}
+
/* aio_get_req
* Allocate a slot for an aio request. Increments the ki_users count
* of the kioctx so that the kioctx stays around until all requests are
@@ -524,7 +613,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
{
struct kiocb *req;

- if (atomic_dec_if_positive(&ctx->reqs_available) <= 0)
+ if (!get_reqs_available(ctx))
return NULL;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
@@ -533,10 +622,9 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)

atomic_set(&req->ki_users, 2);
req->ki_ctx = ctx;
-
return req;
out_put:
- atomic_inc(&ctx->reqs_available);
+ put_reqs_available(ctx, 1);
return NULL;
}

@@ -625,6 +713,10 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
*/
if (unlikely(xchg(&iocb->ki_cancel,
KIOCB_CANCELLED) == KIOCB_CANCELLED)) {
+ /*
+ * Can't use the percpu reqs_available here - could race with
+ * free_ioctx()
+ */
atomic_inc(&ctx->reqs_available);
/* Still need the wake_up in case free_ioctx is waiting */
goto put_rq;
@@ -763,7 +855,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li h%u t%u\n", ret, head, ctx->tail);

- atomic_add(ret, &ctx->reqs_available);
+ put_reqs_available(ctx, ret);
out:
mutex_unlock(&ctx->ring_lock);

@@ -1147,7 +1239,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
aio_put_req(req); /* drop extra ref to req */
return 0;
out_put_req:
- atomic_inc(&ctx->reqs_available);
+ put_reqs_available(ctx, 1);
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
return ret;
--
1.8.2.1

2013-05-14 13:52:24

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On 05/13, Kent Overstreet wrote:
>
> +unsigned tag_alloc(struct tag_pool *pool, bool wait)
> +{
> + struct tag_cpu_freelist *tags;
> + unsigned long flags;
> + unsigned ret;
> +retry:
> + preempt_disable();
> + local_irq_save(flags);
> + tags = this_cpu_ptr(pool->tag_cpu);
> +
> + while (!tags->nr_free) {
> + spin_lock(&pool->lock);
> +
> + if (pool->nr_free)
> + move_tags(tags->free, &tags->nr_free,
> + pool->free, &pool->nr_free,
> + min(pool->nr_free, pool->watermark));
> + else if (wait) {
> + struct tag_waiter wait = { .task = current };
> +
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + list_add(&wait.list, &pool->wait);
> +
> + spin_unlock(&pool->lock);
> + local_irq_restore(flags);
> + preempt_enable();
> +
> + schedule();
> + __set_current_state(TASK_RUNNING);

schedule() always returns in TASK_RUNNING state

> +
> + if (!list_empty_careful(&wait.list)) {
> + spin_lock_irqsave(&pool->lock, flags);
> + list_del_init(&wait.list);
> + spin_unlock_irqrestore(&pool->lock, flags);

This is only theoretical, but racy.

tag_free() does

list_del_init(wait->list);
/* WINDOW */
wake_up_process(wait->task);

in theory the caller of tag_alloc() can notice list_empty_careful(),
return without taking pool->lock, exit, and free this task_struct.

But the main problem is that it is not clear why this code reimplements
add_wait_queue/wake_up_all, for what?

I must admit, I do not understand what this code actually does ;)
I didn't try to read it carefully though, but perhaps at least the
changelog could explain more?

Oleg.

2013-05-14 13:55:45

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On 05/13, Kent Overstreet wrote:
>
> +int percpu_ref_kill(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned __percpu *old;
> + unsigned count = 0;
> + int cpu;
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + do {
> + if (!pcpu_count)
> + return 0;
> +
> + old = pcpu_count;
> + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> + } while (pcpu_count != old);

This is purely cosmetic, feel free to ignore. But afaics all we
need is

pcpu_count = ACCESS_ONCE(ref->pcpu_count);
if (!cmpxchg(&ref->pcpu_count, pcpu_count, NULL))
return 0;

Oleg.

2013-05-14 14:28:10

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On 05/14, Oleg Nesterov wrote:
>
> I must admit, I do not understand what this code actually does ;)
> I didn't try to read it carefully though, but perhaps at least the
> changelog could explain more?

OK, this is clear...

But perhaps the changelog could explain who needs the "fast" version
of, say, find_next_zero_bit + test_and_set_bit ;) Just curious.

Oleg.

2013-05-14 14:59:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Hello,

On Mon, May 13, 2013 at 06:18:41PM -0700, Kent Overstreet wrote:
> +/**
> + * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
> + *
> + * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.

Explanation on synchronization and use cases would be nice. People
tend to develop massive mis-uses for interfaces like this.

> + */
> +static inline int percpu_ref_dead(struct percpu_ref *ref)
> +{
> + return ref->pcpu_count == NULL;
> +}
...
> +/*
> + * The trick to implementing percpu refcounts is shutdown. We can't detect the
> + * ref hitting 0 on every put - this would require global synchronization and
> + * defeat the whole purpose of using percpu refs.
> + *
> + * What we do is require the user to keep track of the initial refcount; we know
> + * the ref can't hit 0 before the user drops the initial ref, so as long as we
> + * convert to non percpu mode before the initial ref is dropped everything
> + * works.

Can you please also explain why per-cpu wrapping is safe somewhere?

> + * Converting to non percpu mode is done with some RCUish stuff in
> + * percpu_ref_kill. Additionally, we need a bias value so that the atomic_t
> + * can't hit 0 before we've added up all the percpu refs.
> + */
> +
> +#define PCPU_COUNT_BIAS (1ULL << 31)

Are we sure this is enough? 1<<31 is a fairly large number but it's
just easy enough to breach from time to time and it's gonna be hellish
to reproduce / debug when it actually overflows. Maybe we want
atomic64_t w/ 1LLU << 63 bias? Or is there something else which
guarantees that the bias can't over/underflow?

> +int percpu_ref_tryget(struct percpu_ref *ref)
> +{
> + int ret = 1;
> +
> + preempt_disable();
> +
> + if (!percpu_ref_dead(ref))
> + percpu_ref_get(ref);
> + else
> + ret = 0;
> +
> + preempt_enable();
> +
> + return ret;
> +}

Why isn't the above one inline?

Why no /** comment on public functions? It'd be great if you can
explicitly warn about the racy nature of the function - especially,
the function may return overflowed or zero refcnt. BTW, why is this
function necessary? What's the use case?

> +unsigned percpu_ref_count(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned count = 0;
> + int cpu;
> +
> + preempt_disable();
> +
> + count = atomic_read(&ref->count);
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (pcpu_count)
> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr(pcpu_count, cpu);
> +
> + preempt_enable();
> +
> + return count;
> +}
...
> +/**
> + * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
> + *
> + * Must be called before dropping the initial ref, so that percpu_ref_put()
> + * knows to check for the refcount hitting 0. If the refcount was in percpu
> + * mode, converts it back to single atomic counter mode.
> + *
> + * The caller must issue a synchronize_rcu()/call_rcu() before calling
> + * percpu_ref_put() to drop the initial ref.
> + *
> + * Returns true the first time called on @ref and false if @ref is already
> + * shutting down, so it may be used by the caller for synchronizing other parts
> + * of a two stage shutdown.
> + */

I'm not sure I like this interface. Why does it allow being called
multiple times? Why is that necessary? Wouldn't just making it
return void and trigger WARN_ON() if it detects that it's being called
multiple times better? Also, why not bool if the return value is
true/false?

> +int percpu_ref_kill(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned __percpu *old;
> + unsigned count = 0;
> + int cpu;
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + do {
> + if (!pcpu_count)
> + return 0;
> +
> + old = pcpu_count;
> + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> + } while (pcpu_count != old);
> +
> + synchronize_sched();

And this makes the whole function blocking. Why not use call_rcu() so
that the ref can be called w/o sleepable context too?

> +
> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr(pcpu_count, cpu);
> +
> + free_percpu(pcpu_count);
> +
> + pr_debug("global %lli pcpu %i",
> + (int64_t) atomic_read(&ref->count), (int) count);
> +
> + atomic_add((int) count - PCPU_COUNT_BIAS, &ref->count);
> +
> + return 1;
> +}
> +
> +/**
> + * percpu_ref_put_initial_ref - safely drop the initial ref
> + *
> + * A percpu refcount needs a shutdown sequence before dropping the initial ref,
> + * to put it back into single atomic_t mode with the appropriate barriers so
> + * that percpu_ref_put() can safely check for it hitting 0 - this does so.
> + *
> + * Returns true if @ref hit 0.
> + */
> +int percpu_ref_put_initial_ref(struct percpu_ref *ref)
> +{
> + if (percpu_ref_kill(ref)) {
> + return percpu_ref_put(ref);
> + } else {
> + WARN_ON(1);
> + return 0;
> + }
> +}

Can we just roll the above into percpu_ref_kill()? It's much harder
to misuse if kill puts the base ref.

Thanks.

--
tejun

2013-05-14 15:03:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On Mon, May 13, 2013 at 06:18:54PM -0700, Kent Overstreet wrote:
> +struct tag_pool {
> + unsigned watermark;
> + unsigned nr_tags;
> +
> + struct tag_cpu_freelist *tag_cpu;
> +
> + struct {
> + /* Global freelist */
> + unsigned nr_free;
> + unsigned *free;
> + spinlock_t lock;
> + struct list_head wait;
> + } ____cacheline_aligned;
> +};

Come on, Kent. No comment at all in the whole posting and no
justification for the patch or explanation of use cases?

--
tejun

2013-05-14 15:32:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On 05/14, Tejun Heo wrote:
>
> > +int percpu_ref_tryget(struct percpu_ref *ref)
> > +{
> > + int ret = 1;
> > +
> > + preempt_disable();
> > +
> > + if (!percpu_ref_dead(ref))
> > + percpu_ref_get(ref);
> > + else
> > + ret = 0;
> > +
> > + preempt_enable();
> > +
> > + return ret;
> > +}
...
> BTW, why is this
> function necessary? What's the use case?

Yes, I was wondering too.

And please note that this code _looks_ wrong, percpu_ref_get() still
can increment ref->count.

Hmm. Just noticed this comment above percpu_ref_kill()

* The caller must issue a synchronize_rcu()/call_rcu() before calling
* percpu_ref_put() to drop the initial ref.

Really?

Oleg.

2013-05-14 21:59:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

A couple more things.

On Mon, May 13, 2013 at 06:18:41PM -0700, Kent Overstreet wrote:
...
> +/**
> + * percpu_ref_put - decrement a dynamic percpu refcount
> + *
> + * Returns true if the result is 0, otherwise false; only checks for the ref
> + * hitting 0 after percpu_ref_kill() has been called. Analagous to
> + * atomic_dec_and_test().
> + */
> +static inline int percpu_ref_put(struct percpu_ref *ref)

bool?

> +{
> + unsigned __percpu *pcpu_count;
> + int ret = 0;
> +
> + preempt_disable();
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (pcpu_count)

We probably want likely() here.

> + __this_cpu_dec(*pcpu_count);
> + else
> + ret = atomic_dec_and_test(&ref->count);
> +
> + preempt_enable();
> +
> + return ret;

With likely() added, I think the compiler should be able to recognize
that the branch on pcpu_count should exclude later branch in the
caller to test for the final put in most cases but I'm a bit worried
whether that would always be the case and wonder whether ->release
based interface would be better. Another concern is that the above
interface is likely to encourage its users to put the release
implementation in the same function. e.g.

void my_put(my_obj)
{
if (!percpu_ref_put(&my_obj->ref))
return;
destroy my_obj;
free my_obj;
}

Which in turn is likely to nudge the developer or compiler towards not
inlining the fast path.

So, while I do like the simplicity of put() returning %true on the
final put, I suspect it's more likely to slowing down fast paths due
to its interface compared to having separate ->release function
combined with void put(). Any ideas?

Thanks.

--
tejun

2013-05-14 22:15:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Hello, again, continuing the brain diarrehea,

On Tue, May 14, 2013 at 02:59:45PM -0700, Tejun Heo wrote:
> So, while I do like the simplicity of put() returning %true on the
> final put, I suspect it's more likely to slowing down fast paths due
> to its interface compared to having separate ->release function
> combined with void put(). Any ideas?

Maybe we can structure put in a way that's difficult to get wrong for
the compiler?

bool put()
{
preempt_disable();
if (likely(not killed yet)) {
this_cpu_dec();
preempt_enable();
return false;
}
return put_slowpath();
}

This doesn't solve the caller not inlining hot path but well I suppose
we can consider that the caller's problem. The above at least
wouldn't introduce an unnecessary branch on its own.

Thanks.

--
tejun

2013-05-15 08:21:49

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Tue, May 14, 2013 at 03:51:01PM +0200, Oleg Nesterov wrote:
> On 05/13, Kent Overstreet wrote:
> >
> > +int percpu_ref_kill(struct percpu_ref *ref)
> > +{
> > + unsigned __percpu *pcpu_count;
> > + unsigned __percpu *old;
> > + unsigned count = 0;
> > + int cpu;
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + do {
> > + if (!pcpu_count)
> > + return 0;
> > +
> > + old = pcpu_count;
> > + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> > + } while (pcpu_count != old);
>
> This is purely cosmetic, feel free to ignore. But afaics all we
> need is
>
> pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> if (!cmpxchg(&ref->pcpu_count, pcpu_count, NULL))
> return 0;

Whoops, yep. I was ripping out the dynamic stuff from my dynamic percpu
refcount code and missed that bit.

2013-05-15 08:59:36

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Tue, May 14, 2013 at 07:59:32AM -0700, Tejun Heo wrote:
> Hello,
>
> On Mon, May 13, 2013 at 06:18:41PM -0700, Kent Overstreet wrote:
> > +/**
> > + * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
> > + *
> > + * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
>
> Explanation on synchronization and use cases would be nice. People
> tend to develop massive mis-uses for interfaces like this.

hrm, kind of hard to know exactly what to say without seeing how people
misuse it first. How about this?

* Returns true the first time called on @ref and false percpu_ref_kill() has
* already been called on @ref.
*
* The return value can optionally be used to synchronize shutdown, when
* multiple threads could try to destroy an object at the same time - if
* percpu_ref_kill() returns true, then this thread should release the initial
* refcount - see percpu_ref_put_initial_ref().



> > + */
> > +static inline int percpu_ref_dead(struct percpu_ref *ref)
> > +{
> > + return ref->pcpu_count == NULL;
> > +}
> ...
> > +/*
> > + * The trick to implementing percpu refcounts is shutdown. We can't detect the
> > + * ref hitting 0 on every put - this would require global synchronization and
> > + * defeat the whole purpose of using percpu refs.
> > + *
> > + * What we do is require the user to keep track of the initial refcount; we know
> > + * the ref can't hit 0 before the user drops the initial ref, so as long as we
> > + * convert to non percpu mode before the initial ref is dropped everything
> > + * works.
>
> Can you please also explain why per-cpu wrapping is safe somewhere?

I feel like we had this exact discussion before and I came up with some
sort of explanation but I can't remember what I came up with. Here's
what I've got now...

* Initially, a percpu refcount is just a set of percpu counters. Initially, we
* don't try to detect the ref hitting 0 - which means that get/put can just
* increment or decrement the local counter. Note that the counter on a
* particular cpu can (and will) wrap - this is fine, when we go to shutdown the
* percpu counters will all sum to the correct value (because moduler arithmatic
* is commutative).

> > + * Converting to non percpu mode is done with some RCUish stuff in
> > + * percpu_ref_kill. Additionally, we need a bias value so that the atomic_t
> > + * can't hit 0 before we've added up all the percpu refs.
> > + */
> > +
> > +#define PCPU_COUNT_BIAS (1ULL << 31)
>
> Are we sure this is enough? 1<<31 is a fairly large number but it's
> just easy enough to breach from time to time and it's gonna be hellish
> to reproduce / debug when it actually overflows. Maybe we want
> atomic64_t w/ 1LLU << 63 bias? Or is there something else which
> guarantees that the bias can't over/underflow?

Well, it has the effect of halving the usable range of the refcount,
which I think is probably ok - the thing is, the range of an atomic_t
doesn't really correspond to anything useful on 64 bit machines so if
you're concerned about overflow you probably need to be using an
atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
too.

If we need a 64-ish bit refcount in the future (I don't think it matters
for AIO) I'll probably just make a percpu_ref64 - that uses u64s for the
percpu counters too.

Or... maybe just make this version use unsigned longs/atomic_long_ts
instead of 32 bit integers. I dunno, I'll think about it a bit.

I don't think it's urgent, it's easy to change the types if and when a
new user comes along for which it matters. IIRC the module code uses 32
bit ints for its refcounts and that's the next thing I was trying to
convert at one point.

> > +int percpu_ref_tryget(struct percpu_ref *ref)
> > +{
> > + int ret = 1;
> > +
> > + preempt_disable();
> > +
> > + if (!percpu_ref_dead(ref))
> > + percpu_ref_get(ref);
> > + else
> > + ret = 0;
> > +
> > + preempt_enable();
> > +
> > + return ret;
> > +}
>
> Why isn't the above one inline?

hmm, I suppose that's like two or three more instructions than normal
get, I'll make it inline.

> Why no /** comment on public functions? It'd be great if you can
> explicitly warn about the racy nature of the function - especially,
> the function may return overflowed or zero refcnt. BTW, why is this
> function necessary? What's the use case?

Module code - I should probably leave count() and tryget() out until the
module conversion is done. tryget() in particular is just trying to
match the existing module_tryget() and it could certainly be implemented
differently.

> > +/**
> > + * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
> > + *
> > + * Must be called before dropping the initial ref, so that percpu_ref_put()
> > + * knows to check for the refcount hitting 0. If the refcount was in percpu
> > + * mode, converts it back to single atomic counter mode.
> > + *
> > + * The caller must issue a synchronize_rcu()/call_rcu() before calling
> > + * percpu_ref_put() to drop the initial ref.
> > + *
> > + * Returns true the first time called on @ref and false if @ref is already
> > + * shutting down, so it may be used by the caller for synchronizing other parts
> > + * of a two stage shutdown.
> > + */
>
> I'm not sure I like this interface. Why does it allow being called
> multiple times? Why is that necessary? Wouldn't just making it
> return void and trigger WARN_ON() if it detects that it's being called
> multiple times better? Also, why not bool if the return value is
> true/false?

bool just feels a bit strange in the kernel because it's not used that
much, but yeah bool is correct here.

Whether it should return bool, or void like you said and WARN_ON() is
definitely debatable. I don't have a strong opinion on it - I did it
this way because it's commonly needed functionality and a convenient
place implement it, but if we see people misusing it in the future I
would definitely rip it out.

> > +int percpu_ref_kill(struct percpu_ref *ref)
> > +{
> > + unsigned __percpu *pcpu_count;
> > + unsigned __percpu *old;
> > + unsigned count = 0;
> > + int cpu;
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + do {
> > + if (!pcpu_count)
> > + return 0;
> > +
> > + old = pcpu_count;
> > + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> > + } while (pcpu_count != old);
> > +
> > + synchronize_sched();
>
> And this makes the whole function blocking. Why not use call_rcu() so
> that the ref can be called w/o sleepable context too?

Because you need to know when percpu_ref_kill() finishes so you know
when it's safe to drop the initial ref - if percpu_ref_kill() used
call_rcu() itself, it would then have to be doing the put itself...
which means we'd have to stick a pointer to the release function in
struct percpu_ref.

But this is definitely going to be an issue... I was thinking about
using the low bit of the pointer to indicate that the ref is dead so
that the caller could use call_rcu() and then call another function to
gather up the percpu counters, but that's pretty ugly.

I may just stick the release function in struct percpu_ref and have
percpu_ref_kill() use call_rcu() after all...

> > +/**
> > + * percpu_ref_put_initial_ref - safely drop the initial ref
> > + *
> > + * A percpu refcount needs a shutdown sequence before dropping the initial ref,
> > + * to put it back into single atomic_t mode with the appropriate barriers so
> > + * that percpu_ref_put() can safely check for it hitting 0 - this does so.
> > + *
> > + * Returns true if @ref hit 0.
> > + */
> > +int percpu_ref_put_initial_ref(struct percpu_ref *ref)
> > +{
> > + if (percpu_ref_kill(ref)) {
> > + return percpu_ref_put(ref);
> > + } else {
> > + WARN_ON(1);
> > + return 0;
> > + }
> > +}
>
> Can we just roll the above into percpu_ref_kill()? It's much harder
> to misuse if kill puts the base ref.

Possibly... if we did that we'd also be getting rid of percpu_ref_kill's
synchronization functionality.

I want to wait until after the call_rcu() thing is decided before
futzing with this part, there's some dependancies.

2013-05-15 09:01:32

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Tue, May 14, 2013 at 05:28:36PM +0200, Oleg Nesterov wrote:
> On 05/14, Tejun Heo wrote:
> >
> > > +int percpu_ref_tryget(struct percpu_ref *ref)
> > > +{
> > > + int ret = 1;
> > > +
> > > + preempt_disable();
> > > +
> > > + if (!percpu_ref_dead(ref))
> > > + percpu_ref_get(ref);
> > > + else
> > > + ret = 0;
> > > +
> > > + preempt_enable();
> > > +
> > > + return ret;
> > > +}
> ...
> > BTW, why is this
> > function necessary? What's the use case?
>
> Yes, I was wondering too.
>
> And please note that this code _looks_ wrong, percpu_ref_get() still
> can increment ref->count.

Yeah I see what you mean, I changed how ret is set.

But also splitting tryget() and count() out into another patch to go
with the module conversion.

> Hmm. Just noticed this comment above percpu_ref_kill()
>
> * The caller must issue a synchronize_rcu()/call_rcu() before calling
> * percpu_ref_put() to drop the initial ref.
>
> Really?

That's also left over from the dynamic version, whoops.

2013-05-15 09:08:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Tue, May 14, 2013 at 02:59:45PM -0700, Tejun Heo wrote:
> A couple more things.
>
> On Mon, May 13, 2013 at 06:18:41PM -0700, Kent Overstreet wrote:
> ...
> > +/**
> > + * percpu_ref_put - decrement a dynamic percpu refcount
> > + *
> > + * Returns true if the result is 0, otherwise false; only checks for the ref
> > + * hitting 0 after percpu_ref_kill() has been called. Analagous to
> > + * atomic_dec_and_test().
> > + */
> > +static inline int percpu_ref_put(struct percpu_ref *ref)
>
> bool?

Was int to match atomic_dec_and_test(), but switching to bool.

>
> > +{
> > + unsigned __percpu *pcpu_count;
> > + int ret = 0;
> > +
> > + preempt_disable();
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + if (pcpu_count)
>
> We probably want likely() here.

Yeah, I suppose so.

>
> > + __this_cpu_dec(*pcpu_count);
> > + else
> > + ret = atomic_dec_and_test(&ref->count);
> > +
> > + preempt_enable();
> > +
> > + return ret;
>
> With likely() added, I think the compiler should be able to recognize
> that the branch on pcpu_count should exclude later branch in the
> caller to test for the final put in most cases but I'm a bit worried
> whether that would always be the case and wonder whether ->release
> based interface would be better. Another concern is that the above
> interface is likely to encourage its users to put the release
> implementation in the same function. e.g.

I... don't follow what you mean hear at all - what exactly would the
compiler do differently? and how would passing a release function
matter?

> void my_put(my_obj)
> {
> if (!percpu_ref_put(&my_obj->ref))
> return;
> destroy my_obj;
> free my_obj;
> }
>
> Which in turn is likely to nudge the developer or compiler towards not
> inlining the fast path.

I'm kind of skeptical partial inlining would be worth it for just an
atomic_dec_and_test()...

> So, while I do like the simplicity of put() returning %true on the
> final put, I suspect it's more likely to slowing down fast paths due
> to its interface compared to having separate ->release function
> combined with void put(). Any ideas?

Oh, you mean having one branch instead of two when we're in percpu mode.
Yeah, that is a good point.

I bet with the likely() added the compiler is going to generate the same
code either way, but I suppose I can have a look at what gcc actually
does...

2013-05-15 09:26:22

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On Tue, May 14, 2013 at 03:48:59PM +0200, Oleg Nesterov wrote:
> On 05/13, Kent Overstreet wrote:
> >
> > +unsigned tag_alloc(struct tag_pool *pool, bool wait)
> > +{
> > + struct tag_cpu_freelist *tags;
> > + unsigned long flags;
> > + unsigned ret;
> > +retry:
> > + preempt_disable();
> > + local_irq_save(flags);
> > + tags = this_cpu_ptr(pool->tag_cpu);
> > +
> > + while (!tags->nr_free) {
> > + spin_lock(&pool->lock);
> > +
> > + if (pool->nr_free)
> > + move_tags(tags->free, &tags->nr_free,
> > + pool->free, &pool->nr_free,
> > + min(pool->nr_free, pool->watermark));
> > + else if (wait) {
> > + struct tag_waiter wait = { .task = current };
> > +
> > + __set_current_state(TASK_UNINTERRUPTIBLE);
> > + list_add(&wait.list, &pool->wait);
> > +
> > + spin_unlock(&pool->lock);
> > + local_irq_restore(flags);
> > + preempt_enable();
> > +
> > + schedule();
> > + __set_current_state(TASK_RUNNING);
>
> schedule() always returns in TASK_RUNNING state
>
> > +
> > + if (!list_empty_careful(&wait.list)) {
> > + spin_lock_irqsave(&pool->lock, flags);
> > + list_del_init(&wait.list);
> > + spin_unlock_irqrestore(&pool->lock, flags);
>
> This is only theoretical, but racy.
>
> tag_free() does
>
> list_del_init(wait->list);
> /* WINDOW */
> wake_up_process(wait->task);
>
> in theory the caller of tag_alloc() can notice list_empty_careful(),
> return without taking pool->lock, exit, and free this task_struct.
>
> But the main problem is that it is not clear why this code reimplements
> add_wait_queue/wake_up_all, for what?

To save on locking... there's really no point in another lock for the
wait queue. Could just use the wait queue lock instead I suppose, like
wait_event_interruptible_locked()

(the extra spin_lock()/unlock() might not really cost anything but
nested irqsave()/restore() is ridiculously expensive, IME).

> I must admit, I do not understand what this code actually does ;)
> I didn't try to read it carefully though, but perhaps at least the
> changelog could explain more?

The changelog is admittedly terse, but that's basically all there is to
it -

Say you've got a device where you can have multiple outstanding
commands - you'll identify commands/responses by some integer (the
"tag"). Typically you won't get a full 64 bits for the tag, it might be
10 or 16 or 32 bits or whatever - and even if you could use raw pointers
you wouldn't really want to because then if the device gives you garbage
response you're derefing an untrusted pointer - you want to allocate tag
structures out of a fixed array so you can validate responses.

So you preallocate all your tag structures up front - now you can refer
to them by small fixed integers. But if you want to be able to
efficiently allocate from the same pool of tags across multiple CPUs -
well, that's what this code is for.

2013-05-15 09:34:45

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On Tue, May 14, 2013 at 04:24:42PM +0200, Oleg Nesterov wrote:
> On 05/14, Oleg Nesterov wrote:
> >
> > I must admit, I do not understand what this code actually does ;)
> > I didn't try to read it carefully though, but perhaps at least the
> > changelog could explain more?
>
> OK, this is clear...
>
> But perhaps the changelog could explain who needs the "fast" version
> of, say, find_next_zero_bit + test_and_set_bit ;) Just curious.

Originally I wrote it for a driver (which still isn't open source) - but
find_next_zero_bit()/test_and_set_bit() is exactly what it was using
before and the performance gain was significant :)

The reason I'm posting it now is because AIO currently uses a linked
list for tracking outstanding kiocbs - for cancellation - and that
linked list needs to be replaced; I'm implementing cancellation for
regular direct IO and the linked list is a performance issue.

All we need for cancellation is a way to iterate over all the
(potentially) allocated kiocbs - it's really exactly the same problem as
managing tags in the drivers I was working on before (they also need to
be able to time out tags which is exactly the same as AIO cancellation).

What I found really annoying about the problem is that the existing slab
allocator tracks exactly what we need... but it's not exposed (and
honestly probably shouldn't be).

So, there were two choices:
* hack up slab/slob/slub - fuck no
* reuse my tag allocator, allocate kiocbs out of an array of pages.
Also allocate the pages lazily so we don't regress on memory
overhead.

So, that's what I did.

2013-05-15 15:44:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On 05/15, Kent Overstreet wrote:
>
> On Tue, May 14, 2013 at 03:48:59PM +0200, Oleg Nesterov wrote:
> > tag_free() does
> >
> > list_del_init(wait->list);
> > /* WINDOW */
> > wake_up_process(wait->task);
> >
> > in theory the caller of tag_alloc() can notice list_empty_careful(),
> > return without taking pool->lock, exit, and free this task_struct.
> >
> > But the main problem is that it is not clear why this code reimplements
> > add_wait_queue/wake_up_all, for what?
>
> To save on locking... there's really no point in another lock for the
> wait queue. Could just use the wait queue lock instead I suppose, like
> wait_event_interruptible_locked()

Yes. Or perhaps you can reuse wait_queue_head_t->lock for move_tags().

And,

> (the extra spin_lock()/unlock() might not really cost anything but
> nested irqsave()/restore() is ridiculously expensive, IME).

But this is the slow path anyway. Even if you do not use _locked, how
much this extra locking (save/restore) can make the things worse?

In any case, I believe it would be much better to reuse the code we
already have, to avoid the races and make the code more understandable.
And to not bloat the code.

Do you really think that, say,

unsigned tag_alloc(struct tag_pool *pool, bool wait)
{
struct tag_cpu_freelist *tags;
unsigned ret = 0;
retry:
tags = get_cpu_ptr(pool->tag_cpu);
local_irq_disable();
if (!tags->nr_free && pool->nr_free) {
spin_lock(&pool->wq.lock);
if (pool->nr_free)
move_tags(...);
spin_unlock(&pool->wq.lock);
}

if (tags->nr_free)
ret = tags->free[--tags->nr_free];
local_irq_enable();
put_cpu_var(pool->tag_cpu);

if (ret || !wait)
return ret;

__wait_event(&pool->wq, pool->nr_free);
goto retry;
}

will be much slower?

> > I must admit, I do not understand what this code actually does ;)
> > I didn't try to read it carefully though, but perhaps at least the
> > changelog could explain more?
>
> The changelog is admittedly terse, but that's basically all there is to
> it -
> [...snip...]

Yes, thanks for your explanation, I already realized what it does...

Question. tag_free() does move_tags+wakeup if nr_free = pool->watermark * 2.
Perhaps it should should also take waitqueue_active() into account ?
tag_alloc() can sleep more than necessary, it seems.

Oleg.

2013-05-15 16:14:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

Damn, sorry for extra noise, I forgot to ask this twice...

And what about cpu_down()? Perhaps tag_pool should do move_tags()
on CPU_DEAD?

Or at least it should be documented that the dead cpu can lose up
to 2 * watermark entries.

Oleg.

2013-05-15 17:37:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Hey, Kent.

On Wed, May 15, 2013 at 01:58:56AM -0700, Kent Overstreet wrote:
> > Explanation on synchronization and use cases would be nice. People
> > tend to develop massive mis-uses for interfaces like this.
>
> hrm, kind of hard to know exactly what to say without seeing how people
> misuse it first. How about this?
>
> * Returns true the first time called on @ref and false percpu_ref_kill() has
> * already been called on @ref.
> *
> * The return value can optionally be used to synchronize shutdown, when
> * multiple threads could try to destroy an object at the same time - if
> * percpu_ref_kill() returns true, then this thread should release the initial
> * refcount - see percpu_ref_put_initial_ref().

Ooh, I was referring to percpu_ref_dead() not percpu_ref_kill().
percpu_ref_dead() reminds me of some of the work state query functions
in workqueue which ended up being misused in ways that were subtly
racy, so I'm curious why it's necessary and how it's supposed to be
used.

> > > + * What we do is require the user to keep track of the initial refcount; we know
> > > + * the ref can't hit 0 before the user drops the initial ref, so as long as we
> > > + * convert to non percpu mode before the initial ref is dropped everything
> > > + * works.
> >
> > Can you please also explain why per-cpu wrapping is safe somewhere?
>
> I feel like we had this exact discussion before and I came up with some

Yeap, we did.

> sort of explanation but I can't remember what I came up with. Here's
> what I've got now...
>
> * Initially, a percpu refcount is just a set of percpu counters. Initially, we
> * don't try to detect the ref hitting 0 - which means that get/put can just
> * increment or decrement the local counter. Note that the counter on a
> * particular cpu can (and will) wrap - this is fine, when we go to shutdown the
> * percpu counters will all sum to the correct value (because moduler arithmatic
> * is commutative).

Can you please expand it on a bit and, more importantly, describe in
what limits, it's safe? This should be safe as long as the actual sum
of refcnts given out doesn't overflow the original type, right? It'd
be great if that is explained clearly in more intuitive way. The only
actual explanation above is "modular arithmatic is commutative" which
is a very compact way to put it and I really think it deserves an
easier explanation.

> > Are we sure this is enough? 1<<31 is a fairly large number but it's
> > just easy enough to breach from time to time and it's gonna be hellish
> > to reproduce / debug when it actually overflows. Maybe we want
> > atomic64_t w/ 1LLU << 63 bias? Or is there something else which
> > guarantees that the bias can't over/underflow?
>
> Well, it has the effect of halving the usable range of the refcount,
> which I think is probably ok - the thing is, the range of an atomic_t
> doesn't really correspond to anything useful on 64 bit machines so if
> you're concerned about overflow you probably need to be using an
> atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
> too.

I'm not worrying about the total refcnt overflowing 31 bits, that's
fine. What I'm worried about is the percpu refs having systmetic
drift (got on certain cpus and put on others), and the total counter
being overflowed while percpu draining is in progress. To me, the
problem is that the bias which tags that draining in progress can be
overflown by percpu refs. The summing can be the same but the tagging
should be put where summing can't overflow it. It'd be great if you
can explain in the comment in what range it's safe and why, because
that'd make the limits clear to both you and other people reading the
code and would help a lot in deciding whether it's safe enough.

> > Why no /** comment on public functions? It'd be great if you can
> > explicitly warn about the racy nature of the function - especially,
> > the function may return overflowed or zero refcnt. BTW, why is this
> > function necessary? What's the use case?
>
> Module code - I should probably leave count() and tryget() out until the
> module conversion is done. tryget() in particular is just trying to
> match the existing module_tryget() and it could certainly be implemented
> differently.

I probably should have made it clearer. Sorry about that. tryget()
is fine. I was curious about count() as it's always a bit dangerous a
query interface which is racy and can return something unexpected like
false zero or underflowed refcnt.

> > I'm not sure I like this interface. Why does it allow being called
> > multiple times? Why is that necessary? Wouldn't just making it
> > return void and trigger WARN_ON() if it detects that it's being called
> > multiple times better? Also, why not bool if the return value is
> > true/false?
>
> bool just feels a bit strange in the kernel because it's not used that
> much, but yeah bool is correct here.

Well, it's added later on and we're still in the process of converting
to bool. New things are supposed to use it and they do most of the
time, so let's please stick to it.

> Whether it should return bool, or void like you said and WARN_ON() is
> definitely debatable. I don't have a strong opinion on it - I did it
> this way because it's commonly needed functionality and a convenient
> place implement it, but if we see people misusing it in the future I
> would definitely rip it out.

It's superflous and kinda reminds me of get(ptr) returning ptr, which
people though would be neat as it allows chaining calls on top of it.
It hides the fact the ptr can't change from the compiler and much more
importantly in not so few cases it led people to check the return
value for NULL believing it somehow would take care of the last put
synchronization. I think we still have such bugs lurking around
kobject.

And I think this one also provides ample opportunities for misuses.
It's an interface which kills a refcnt but doesn't require a reference
as it doesn't put one and people would be tempted to write the
following in racy paths.

if (ref_kill(ref))
ref_put(ref);

which seems innocent enough, except that it almost invites
use-after-free on the ref_kill() call. Why is it calling a function
which kills the ref if it doesn't hold a ref?

Again, let's *please* stick to the known patterns unless deviation is
explicitly justified. There are very good reasons why people want
justifications when something deviates from the established
conventions / what's neceassary for a given interface. It's very easy
to introduce something which is broken in subtle yet fundamental ways
and it's just a fact that the author or reviewers are gonna miss some
eventually.

I think I've repeated this multiple times over the past year but here
it is again - deviation or complexity require justification. We
absolutely shouldn't be doing something unnecessarily unusual and then
try to see what happens. The risk is higher than immediately visible
and totally unnecessary.

> > > +
> > > + synchronize_sched();
> >
> > And this makes the whole function blocking. Why not use call_rcu() so
> > that the ref can be called w/o sleepable context too?
>
> Because you need to know when percpu_ref_kill() finishes so you know
> when it's safe to drop the initial ref - if percpu_ref_kill() used
> call_rcu() itself, it would then have to be doing the put itself...
> which means we'd have to stick a pointer to the release function in
> struct percpu_ref.

Hmmm... okay.

> But this is definitely going to be an issue... I was thinking about
> using the low bit of the pointer to indicate that the ref is dead so
> that the caller could use call_rcu() and then call another function to
> gather up the percpu counters, but that's pretty ugly.
>
> I may just stick the release function in struct percpu_ref and have
> percpu_ref_kill() use call_rcu() after all...

That seems like a better option. It definitely is hell lot more
intuitive.

> > Can we just roll the above into percpu_ref_kill()? It's much harder
> > to misuse if kill puts the base ref.
>
> Possibly... if we did that we'd also be getting rid of percpu_ref_kill's
> synchronization functionality.
>
> I want to wait until after the call_rcu() thing is decided before
> futzing with this part, there's some dependancies.

Let's just have percpu_ref_kill(ref, release) which puts the base ref
and invokes release whenever it's done.

Thanks.

--
tejun

2013-05-15 17:52:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 21/21] block: Bio cancellation

On Mon, May 13 2013, Kent Overstreet wrote:
> If a bio is associated with a kiocb, allow it to be cancelled.
>
> This is accomplished by adding a pointer to a kiocb in struct bio, and
> when we go to dequeue a request we check if its bio has been cancelled -
> if so, we end the request with -ECANCELED.
>
> We don't currently try to cancel bios if IO has already been started -
> that'd require a per bio callback function, and a way to find all the
> outstanding bios for a given kiocb. Such a mechanism may or may not be
> added in the future but this patch tries to start simple.
>
> Currently this can only be triggered with aio and io_cancel(), but the
> mechanism can be used for sync io too.
>
> It can also be used for bios created by stacking drivers, and bio clones
> in general - when cloning a bio, if the bi_iocb pointer is copied as
> well the clone will then be cancellable. bio_clone() could be modified
> to do this, but hasn't in this patch because all the bio_clone() users
> would need to be auditied to make sure that it's safe. We can't blindly
> make e.g. raid5 writes cancellable without the knowledge of the md code.

This is a pretty ugly hack, to be honest. It only works for aio. And it
grows struct bio just for that.

I do like the staged approach, where we just check whether a bio is
canceled when we come across it in the various parts of bio allocate to
completion.

> @@ -2124,6 +2130,12 @@ struct request *blk_peek_request(struct request_queue *q)
> trace_block_rq_issue(q, rq);
> }
>
> + if (rq->bio && !rq->bio->bi_next && bio_cancelled(rq->bio)) {
> + blk_start_request(rq);
> + __blk_end_request_all(rq, -ECANCELED);
> + continue;
> + }

Pretty hacky too, given that it only works for the generic case of a
non-merged bio.

So nack on this one.

--
Jens Axboe

2013-05-15 17:56:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Hey,

On Wed, May 15, 2013 at 02:07:42AM -0700, Kent Overstreet wrote:
> > > + __this_cpu_dec(*pcpu_count);
> > > + else
> > > + ret = atomic_dec_and_test(&ref->count);
> > > +
> > > + preempt_enable();
> > > +
> > > + return ret;
> >
> > With likely() added, I think the compiler should be able to recognize
> > that the branch on pcpu_count should exclude later branch in the
> > caller to test for the final put in most cases but I'm a bit worried
> > whether that would always be the case and wonder whether ->release
> > based interface would be better. Another concern is that the above
> > interface is likely to encourage its users to put the release
> > implementation in the same function. e.g.
>
> I... don't follow what you mean hear at all - what exactly would the
> compiler do differently? and how would passing a release function
> matter?

So, on the fast path, there should be one branch on the percpu
pointer; however, given the above code, especially without likely(),
the compiler may well choose to emit two branches which are shared by
both hot and cold paths - the first one on the percpu pointer, the
second on whether ref->count reached zero. It just isn't clear to the
compiler whether duplicated preempt_enable() or an extra branch would
be cheaper.

> > void my_put(my_obj)
> > {
> > if (!percpu_ref_put(&my_obj->ref))
> > return;
> > destroy my_obj;
> > free my_obj;
> > }
> >
> > Which in turn is likely to nudge the developer or compiler towards not
> > inlining the fast path.
>
> I'm kind of skeptical partial inlining would be worth it for just an
> atomic_dec_and_test()...

Ooh, you can do the slow path inline too but I *suspect* we probably
need a bit more logic in the slowpath anyway if we wanna take care of
the bias overflow and maybe the release callback, and it really
doesn't matter a bit whether you have a call for slowpath, so...

> > So, while I do like the simplicity of put() returning %true on the
> > final put, I suspect it's more likely to slowing down fast paths due
> > to its interface compared to having separate ->release function
> > combined with void put(). Any ideas?
>
> Oh, you mean having one branch instead of two when we're in percpu mode.
> Yeah, that is a good point.

Yeap, heh, I should have read to the end before repling. :)

> I bet with the likely() added the compiler is going to generate the same
> code either way, but I suppose I can have a look at what gcc actually
> does...

Yeah, with likely(), I *think* gcc should get it right most of the
time. There might be some edge cases tho.

Thanks.

--
tejun

2013-05-15 19:29:44

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 21/21] block: Bio cancellation

On Wed, May 15, 2013 at 07:52:43PM +0200, Jens Axboe wrote:
> On Mon, May 13 2013, Kent Overstreet wrote:
> > If a bio is associated with a kiocb, allow it to be cancelled.
> >
> > This is accomplished by adding a pointer to a kiocb in struct bio, and
> > when we go to dequeue a request we check if its bio has been cancelled -
> > if so, we end the request with -ECANCELED.
> >
> > We don't currently try to cancel bios if IO has already been started -
> > that'd require a per bio callback function, and a way to find all the
> > outstanding bios for a given kiocb. Such a mechanism may or may not be
> > added in the future but this patch tries to start simple.
> >
> > Currently this can only be triggered with aio and io_cancel(), but the
> > mechanism can be used for sync io too.
> >
> > It can also be used for bios created by stacking drivers, and bio clones
> > in general - when cloning a bio, if the bi_iocb pointer is copied as
> > well the clone will then be cancellable. bio_clone() could be modified
> > to do this, but hasn't in this patch because all the bio_clone() users
> > would need to be auditied to make sure that it's safe. We can't blindly
> > make e.g. raid5 writes cancellable without the knowledge of the md code.
>
> This is a pretty ugly hack, to be honest. It only works for aio. And it
> grows struct bio just for that.

It's only implemented for aio in this patch but it's actually completely
trivial to extend to sync kiocbs too - we can make killing a process
cancel outstanding sync DIOs, I just haven't gotten around to writing
the code. With sync kiocbs anything can use it.

I do hate to grow struct bio, but the aio attribute stuff I'm also
working on is going to need the same damn thing.

> I do like the staged approach, where we just check whether a bio is
> canceled when we come across it in the various parts of bio allocate to
> completion.

Yeah, that's the only sane way to do it imo. If we had to do it with the
ki_cancel callback, since bios -> kiocbs isn't 1:1 we'd have to keep all
the outstanding bios on a list protected by a lock so we could chase
down all the bios we need to cancel, and I don't even want to think
about stacking devices...

This is also trivial to plumb through stacking devices, for ones that
want to support it - md for example probably wouldn't want to support
cancellation for writes (raid consistency) but for reads all it has to
do is copy the kiocb pointer to the new bios it creates.

(I keep having people tell me we're (Google) going to need cancel for
outstanding NCQ/TCQ requests, to which my response has been "LALALALA GO
AWAY I CAN'T HEAR YOU").

> > @@ -2124,6 +2130,12 @@ struct request *blk_peek_request(struct request_queue *q)
> > trace_block_rq_issue(q, rq);
> > }
> >
> > + if (rq->bio && !rq->bio->bi_next && bio_cancelled(rq->bio)) {
> > + blk_start_request(rq);
> > + __blk_end_request_all(rq, -ECANCELED);
> > + continue;
> > + }
>
> Pretty hacky too, given that it only works for the generic case of a
> non-merged bio.

More incomplete than hacky, imo - since with spinning disks you wouldn't
save much by cancelling one bio out of a merged request. It would make
sense to cancel the request if all the bios have been cancelled, but
wanted to start out simple and get something useful with a minimal
amount of code.

Anyways, this patch is still more at the RFC stage but there is serious
demand for cancellation (I've seen what people are using it for, it's
not all crazy and the lack of it is something people are working around
today, painfully).

2013-05-15 20:01:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 21/21] block: Bio cancellation

On Wed, May 15 2013, Kent Overstreet wrote:
> On Wed, May 15, 2013 at 07:52:43PM +0200, Jens Axboe wrote:
> > On Mon, May 13 2013, Kent Overstreet wrote:
> > > If a bio is associated with a kiocb, allow it to be cancelled.
> > >
> > > This is accomplished by adding a pointer to a kiocb in struct bio, and
> > > when we go to dequeue a request we check if its bio has been cancelled -
> > > if so, we end the request with -ECANCELED.
> > >
> > > We don't currently try to cancel bios if IO has already been started -
> > > that'd require a per bio callback function, and a way to find all the
> > > outstanding bios for a given kiocb. Such a mechanism may or may not be
> > > added in the future but this patch tries to start simple.
> > >
> > > Currently this can only be triggered with aio and io_cancel(), but the
> > > mechanism can be used for sync io too.
> > >
> > > It can also be used for bios created by stacking drivers, and bio clones
> > > in general - when cloning a bio, if the bi_iocb pointer is copied as
> > > well the clone will then be cancellable. bio_clone() could be modified
> > > to do this, but hasn't in this patch because all the bio_clone() users
> > > would need to be auditied to make sure that it's safe. We can't blindly
> > > make e.g. raid5 writes cancellable without the knowledge of the md code.
> >
> > This is a pretty ugly hack, to be honest. It only works for aio. And it
> > grows struct bio just for that.
>
> It's only implemented for aio in this patch but it's actually completely
> trivial to extend to sync kiocbs too - we can make killing a process
> cancel outstanding sync DIOs, I just haven't gotten around to writing
> the code. With sync kiocbs anything can use it.

Oh, that wasn't even my point. It only works for iocb "backed" bios was
my point. You would ideally like cancel for other areas as well. One
that comes to mind is truncating files, for instance.

> I do hate to grow struct bio, but the aio attribute stuff I'm also
> working on is going to need the same damn thing.

If you (you being aio here) wants to support cancel, then why not just
stuff it into bi_private?

> > I do like the staged approach, where we just check whether a bio is
> > canceled when we come across it in the various parts of bio allocate to
> > completion.
>
> Yeah, that's the only sane way to do it imo. If we had to do it with the
> ki_cancel callback, since bios -> kiocbs isn't 1:1 we'd have to keep all
> the outstanding bios on a list protected by a lock so we could chase
> down all the bios we need to cancel, and I don't even want to think
> about stacking devices...

Perfection is the enemy of good. Doing tracking across the full stack is
just going to be insane, just don't do it...

> This is also trivial to plumb through stacking devices, for ones that
> want to support it - md for example probably wouldn't want to support
> cancellation for writes (raid consistency) but for reads all it has to
> do is copy the kiocb pointer to the new bios it creates.
>
> (I keep having people tell me we're (Google) going to need cancel for
> outstanding NCQ/TCQ requests, to which my response has been "LALALALA GO
> AWAY I CAN'T HEAR YOU").

I'm still in that camp, to be honest, even for the generic cases. And
for IO that has gone to the hardware, well, that's really into la-la
land. That is just never going to be something that is supportable,
except perhaps for very confined and controlled setups (like Googles, I
would imagine :-).

> > > @@ -2124,6 +2130,12 @@ struct request *blk_peek_request(struct request_queue *q)
> > > trace_block_rq_issue(q, rq);
> > > }
> > >
> > > + if (rq->bio && !rq->bio->bi_next && bio_cancelled(rq->bio)) {
> > > + blk_start_request(rq);
> > > + __blk_end_request_all(rq, -ECANCELED);
> > > + continue;
> > > + }
> >
> > Pretty hacky too, given that it only works for the generic case of a
> > non-merged bio.
>
> More incomplete than hacky, imo - since with spinning disks you wouldn't
> save much by cancelling one bio out of a merged request. It would make
> sense to cancel the request if all the bios have been cancelled, but
> wanted to start out simple and get something useful with a minimal
> amount of code.
>
> Anyways, this patch is still more at the RFC stage but there is serious
> demand for cancellation (I've seen what people are using it for, it's
> not all crazy and the lack of it is something people are working around
> today, painfully).

I'd be willing to entertain the idea, if the implementation is low
enough overhead and makes sense. So not completely nacking the idea, I'd
just prefer to see something a bit more baked.

--
Jens Axboe

2013-05-15 20:19:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

Kent Overstreet <[email protected]> writes:

> Allocates integers out of a predefined range - for use by e.g. a driver
> to allocate tags for communicating with the device.

Can this really not be merged with idr.c ?

Would a idr per cpu do?

-Andi

--
[email protected] -- Speaking for myself only

2013-05-16 01:07:25

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Kent Overstreet <[email protected]> writes:
> This implements a refcount with similar semantics to
> atomic_get()/atomic_dec_and_test() - but percpu.

Ah! This is why I was CC'd... Now I understand. Thanks :)

Delighted to see someone chasing this. I had an implementation of such
a thing last decade, but the slowmode pattern didn't make for trivial
kref conversions, so I dropped it.

Note: I haven't read the other feedback yet, so ignore if dups.

> +int percpu_ref_init(struct percpu_ref *ref);

Why not just run is slow mode when allocation fails? Things which can't
fail make for simpler use.

> +int percpu_ref_tryget(struct percpu_ref *ref);
> +int percpu_ref_put_initial_ref(struct percpu_ref *ref);

This is part of a slightly different pattern: the owned refcount.

In fact, I think that's the most sane pattern to use (but I could be
wrong; does the AIO stuff fit?). If so, promote this to the first class
citizen, and if necessary expose kill as __percpu_ref_kill()?

(I might suggest percpu_ref_owner_put() as a name, in fact).

> +/**
> + * percpu_ref_get - increment a dynamic percpu refcount
> + *
> + * Analagous to atomic_inc().
> + */
> +static inline void percpu_ref_get(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> +
> + preempt_disable();
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (pcpu_count)
> + __this_cpu_inc(*pcpu_count);
> + else
> + atomic_inc(&ref->count);
> +
> + preempt_enable();
> +}

s/preempt_disable()/rcu_read_lock()/ ?

> +/**
> + * percpu_ref_put - decrement a dynamic percpu refcount
> + *
> + * Returns true if the result is 0, otherwise false; only checks for the ref
> + * hitting 0 after percpu_ref_kill() has been called. Analagous to
> + * atomic_dec_and_test().
> + */
> +static inline int percpu_ref_put(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + int ret = 0;
> +
> + preempt_disable();
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (pcpu_count)
> + __this_cpu_dec(*pcpu_count);
> + else
> + ret = atomic_dec_and_test(&ref->count);
> +
> + preempt_enable();
> +
> + return ret;
> +}

Here too. And if you don't put unlikely() in this code, you lose kernel
hacker points :)

And int/true/false is for old-timers.

> +
> +unsigned percpu_ref_count(struct percpu_ref *ref);
> +int percpu_ref_kill(struct percpu_ref *ref);
> +
> +/**
> + * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
> + *
> + * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
> + */
> +static inline int percpu_ref_dead(struct percpu_ref *ref)
> +{
> + return ref->pcpu_count == NULL;
> +}

Can you unexpose these? I think percpu_ref_init(), ...get(), ...put()
and ...put_initial() are a nicer API.

> +int percpu_ref_kill(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned __percpu *old;
> + unsigned count = 0;
> + int cpu;
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + do {
> + if (!pcpu_count)
> + return 0;
> +
> + old = pcpu_count;
> + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> + } while (pcpu_count != old);

This is more complex than it needs to be, no?


pcpu_count = ACCESS_ONCE(ref->pcpu_count);
if (!pcpu_count)
return 0;
if (cmpxchg(&ref->pcpu_count, pcpu_count, NULL) == NULL)
return 0;

Of course, if all callers use the owner pattern, this is simply:

pcpu_count = ACCESS_ONCE(ref->pcpu_count);
BUG_ON(!pcpu_count);

> + synchronize_sched();

synchronize_rcu() ?

> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr(pcpu_count, cpu);
> +
> + free_percpu(pcpu_count);
> +
> + pr_debug("global %lli pcpu %i",
> + (int64_t) atomic_read(&ref->count), (int) count);
> +
> + atomic_add((int) count - PCPU_COUNT_BIAS, &ref->count);
> +
> + return 1;
> +}
> +
> +/**
> + * percpu_ref_put_initial_ref - safely drop the initial ref
> + *
> + * A percpu refcount needs a shutdown sequence before dropping the initial ref,
> + * to put it back into single atomic_t mode with the appropriate barriers so
> + * that percpu_ref_put() can safely check for it hitting 0 - this does so.
> + *
> + * Returns true if @ref hit 0.
> + */
> +int percpu_ref_put_initial_ref(struct percpu_ref *ref)
> +{
> + if (percpu_ref_kill(ref)) {
> + return percpu_ref_put(ref);
> + } else {
> + WARN_ON(1);
> + return 0;
> + }
> +}

Note that percpu_ref_restore_initial_ref() is also possible, and may be
useful for the module code... (or percpu_ref_owner_get).

Great stuff!
Rusty.

2013-05-28 23:47:34

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
> Ooh, I was referring to percpu_ref_dead() not percpu_ref_kill().
> percpu_ref_dead() reminds me of some of the work state query functions
> in workqueue which ended up being misused in ways that were subtly
> racy, so I'm curious why it's necessary and how it's supposed to be
> used.

With the other changes we talked about I ended up killing
percpu_ref_dead()

> > * Initially, a percpu refcount is just a set of percpu counters. Initially, we
> > * don't try to detect the ref hitting 0 - which means that get/put can just
> > * increment or decrement the local counter. Note that the counter on a
> > * particular cpu can (and will) wrap - this is fine, when we go to shutdown the
> > * percpu counters will all sum to the correct value (because moduler arithmatic
> > * is commutative).
>
> Can you please expand it on a bit and, more importantly, describe in
> what limits, it's safe? This should be safe as long as the actual sum
> of refcnts given out doesn't overflow the original type, right?

Precisely.

> It'd be great if that is explained clearly in more intuitive way. The
> only actual explanation above is "modular arithmatic is commutative"
> which is a very compact way to put it and I really think it deserves
> an easier explanation.

I'm not sure I know of any good way of explaining it intuitively, but
here's this at least...

* (More precisely: because moduler arithmatic is commutative the sum of all the
* pcpu_count vars will be equal to what it would have been if all the gets and
* puts were done to a single integer, even if some of the percpu integers
* overflow or underflow).

> > > Are we sure this is enough? 1<<31 is a fairly large number but it's
> > > just easy enough to breach from time to time and it's gonna be hellish
> > > to reproduce / debug when it actually overflows. Maybe we want
> > > atomic64_t w/ 1LLU << 63 bias? Or is there something else which
> > > guarantees that the bias can't over/underflow?
> >
> > Well, it has the effect of halving the usable range of the refcount,
> > which I think is probably ok - the thing is, the range of an atomic_t
> > doesn't really correspond to anything useful on 64 bit machines so if
> > you're concerned about overflow you probably need to be using an
> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
> > too.
>
> I'm not worrying about the total refcnt overflowing 31 bits, that's
> fine. What I'm worried about is the percpu refs having systmetic
> drift (got on certain cpus and put on others), and the total counter
> being overflowed while percpu draining is in progress. To me, the
> problem is that the bias which tags that draining in progress can be
> overflown by percpu refs. The summing can be the same but the tagging
> should be put where summing can't overflow it. It'd be great if you
> can explain in the comment in what range it's safe and why, because
> that'd make the limits clear to both you and other people reading the
> code and would help a lot in deciding whether it's safe enough.

(This is why I initially didn't (don't) like the bias method, it makes
things harder to reason about).

The fact that the counter is percpu is irrelevant w.r.t. the bias; we
sum all the percpu counters up before adding them to the atomic counter
and subtracting the bias, so when we go to add the percpu counters it's
no different from if the percpu counter was a single integer all along.

So there's only two counters we're adding together; there's the percpu
counter (just think of it as a single integer) that we started out
using, but then at some point in time we start applying the gets and
puts to the atomic counter.

Note that there's no systemic drift here; at time t all the gets and
puts were happening to one counter, and then at time t+1 they switch to
a different counter.

We know the sum of the counters will be positive (again, because modular
arithmatic is still commutative; when we sum the counters it's as if
there was a single counter all along) but that doesn't mean either of
the individual counters can't be negative.

(Actually, unless I'm mistaken in this version the percpu counter can
never go negative - it definitely could with dynamic percpu allocation,
as you need a atomic_t -> percpu transition when the atomic_t was > 0
for the percpu counter to go negative; but in this version we start out
using the percpu counters and the atomic_t 0 (ignoring for the moment
the bias and the initial ref).

So, the sum must be positive but the atomic_t could be negative. How
negative?

We can't do a get() to the percpu counters after we've seen that the ref
is no longer in percpu mode - so after we've done one put to the
atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
we can't do a get to the percpu counter.

And we can't do more puts than there have been gets - because the sum
can't be negative. So the most puts() we can do at any given time is the
real count, or sum of the percpu ref and atomic_t.

Therefore, the amount the atomic_t can go negative is bounded by the
maximum value of the refcount.

So if we say arbitrarily that the maximum legal value of the refcount is
- say - 1U << 31, then the atomic_t will always be greater than
-((int) (1U << 31)).

So as long as the total number of outstanding refs never exceeds the
bias we're fine.

QED.

> I probably should have made it clearer. Sorry about that. tryget()
> is fine. I was curious about count() as it's always a bit dangerous a
> query interface which is racy and can return something unexpected like
> false zero or underflowed refcnt.

Yeah, it is, it was intended just for the module code where it's only
used for the value lsmod shows.

> Let's just have percpu_ref_kill(ref, release) which puts the base ref
> and invokes release whenever it's done.

Release has to be stored in struct percpu_ref() so it can be invoked
after a call_rcu() (percpu_ref_kill -> call_rcu() ->
percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
percpu_ref_init(), but yeah.

2013-05-29 01:11:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Yo,

On Tue, May 28, 2013 at 04:47:28PM -0700, Kent Overstreet wrote:
> > It'd be great if that is explained clearly in more intuitive way. The
> > only actual explanation above is "modular arithmatic is commutative"
> > which is a very compact way to put it and I really think it deserves
> > an easier explanation.
>
> I'm not sure I know of any good way of explaining it intuitively, but
> here's this at least...
>
> * (More precisely: because moduler arithmatic is commutative the sum of all the
> * pcpu_count vars will be equal to what it would have been if all the gets and
> * puts were done to a single integer, even if some of the percpu integers
> * overflow or underflow).

Yeah, that's much better.

> And we can't do more puts than there have been gets - because the sum
> can't be negative. So the most puts() we can do at any given time is the
> real count, or sum of the percpu ref and atomic_t.
>
> Therefore, the amount the atomic_t can go negative is bounded by the
> maximum value of the refcount.

Ah, okay, I thought you were collecting the percpu counters directly
into the global counter. You're staging it into a temp counter and
then adding it into the global counter after the summing is complete.
Yeap, that should be fine then. It'd be worthwhile to document the
importance of not adding it directly to the global counter.

> > I probably should have made it clearer. Sorry about that. tryget()
> > is fine. I was curious about count() as it's always a bit dangerous a
> > query interface which is racy and can return something unexpected like
> > false zero or underflowed refcnt.
>
> Yeah, it is, it was intended just for the module code where it's only
> used for the value lsmod shows.

Let's document so then and limit the range returned. We require the
refcnt to be alive and it'd be a good way to both protect from and
deter creative usages.

> > Let's just have percpu_ref_kill(ref, release) which puts the base ref
> > and invokes release whenever it's done.
>
> Release has to be stored in struct percpu_ref() so it can be invoked
> after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> percpu_ref_init(), but yeah.

Yeah, I'm a bit torn about where to put the release function. For me,
as we have an API which is dedicated to killing a refcnt, it does make
sense to put it there but it's really in the realm of bikeshedding so
choose whatever you wanna choose.

Thanks!

--
tejun

2013-05-29 05:07:49

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

Kent Overstreet <[email protected]> writes:
> On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
>> Can you please expand it on a bit and, more importantly, describe in
>> what limits, it's safe? This should be safe as long as the actual sum
>> of refcnts given out doesn't overflow the original type, right?
>
> Precisely.
>
>> It'd be great if that is explained clearly in more intuitive way. The
>> only actual explanation above is "modular arithmatic is commutative"
>> which is a very compact way to put it and I really think it deserves
>> an easier explanation.
>
> I'm not sure I know of any good way of explaining it intuitively, but
> here's this at least...
>
> * (More precisely: because moduler arithmatic is commutative the sum of all the
> * pcpu_count vars will be equal to what it would have been if all the gets and
> * puts were done to a single integer, even if some of the percpu integers
> * overflow or underflow).

This seems intuitively obvious, so I wouldn't sweat it too much. What
goes up, has to come down somewhere.

>> > > Are we sure this is enough? 1<<31 is a fairly large number but it's
>> > > just easy enough to breach from time to time and it's gonna be hellish
>> > > to reproduce / debug when it actually overflows. Maybe we want
>> > > atomic64_t w/ 1LLU << 63 bias? Or is there something else which
>> > > guarantees that the bias can't over/underflow?
>> >
>> > Well, it has the effect of halving the usable range of the refcount,
>> > which I think is probably ok - the thing is, the range of an atomic_t
>> > doesn't really correspond to anything useful on 64 bit machines so if
>> > you're concerned about overflow you probably need to be using an
>> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
>> > too.
>>
>> I'm not worrying about the total refcnt overflowing 31 bits, that's
>> fine. What I'm worried about is the percpu refs having systmetic
>> drift (got on certain cpus and put on others), and the total counter
>> being overflowed while percpu draining is in progress. To me, the
>> problem is that the bias which tags that draining in progress can be
>> overflown by percpu refs. The summing can be the same but the tagging
>> should be put where summing can't overflow it. It'd be great if you
>> can explain in the comment in what range it's safe and why, because
>> that'd make the limits clear to both you and other people reading the
>> code and would help a lot in deciding whether it's safe enough.
>
> (This is why I initially didn't (don't) like the bias method, it makes
> things harder to reason about).
>
> The fact that the counter is percpu is irrelevant w.r.t. the bias; we
> sum all the percpu counters up before adding them to the atomic counter
> and subtracting the bias, so when we go to add the percpu counters it's
> no different from if the percpu counter was a single integer all along.
>
> So there's only two counters we're adding together; there's the percpu
> counter (just think of it as a single integer) that we started out
> using, but then at some point in time we start applying the gets and
> puts to the atomic counter.
>
> Note that there's no systemic drift here; at time t all the gets and
> puts were happening to one counter, and then at time t+1 they switch to
> a different counter.
>
> We know the sum of the counters will be positive (again, because modular
> arithmatic is still commutative; when we sum the counters it's as if
> there was a single counter all along) but that doesn't mean either of
> the individual counters can't be negative.
>
> (Actually, unless I'm mistaken in this version the percpu counter can
> never go negative - it definitely could with dynamic percpu allocation,
> as you need a atomic_t -> percpu transition when the atomic_t was > 0
> for the percpu counter to go negative; but in this version we start out
> using the percpu counters and the atomic_t 0 (ignoring for the moment
> the bias and the initial ref).
>
> So, the sum must be positive but the atomic_t could be negative. How
> negative?
>
> We can't do a get() to the percpu counters after we've seen that the ref
> is no longer in percpu mode - so after we've done one put to the
> atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
> we can't do a get to the percpu counter.
>
> And we can't do more puts than there have been gets - because the sum
> can't be negative. So the most puts() we can do at any given time is the
> real count, or sum of the percpu ref and atomic_t.
>
> Therefore, the amount the atomic_t can go negative is bounded by the
> maximum value of the refcount.
>
> So if we say arbitrarily that the maximum legal value of the refcount is
> - say - 1U << 31, then the atomic_t will always be greater than
> -((int) (1U << 31)).
>
> So as long as the total number of outstanding refs never exceeds the
> bias we're fine.

Yes. We should note the 31 bit limit somewhere. We could WARN_ON() if
count is >= BIAS in percpu_ref_kill(), perhaps.

>> I probably should have made it clearer. Sorry about that. tryget()
>> is fine. I was curious about count() as it's always a bit dangerous a
>> query interface which is racy and can return something unexpected like
>> false zero or underflowed refcnt.
>
> Yeah, it is, it was intended just for the module code where it's only
> used for the value lsmod shows.

Open code it there?

>> Let's just have percpu_ref_kill(ref, release) which puts the base ref
>> and invokes release whenever it's done.
>
> Release has to be stored in struct percpu_ref() so it can be invoked
> after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> percpu_ref_init(), but yeah.

Or hand it to percpu_ref_put(), too, as per kref_put(). I hate indirect
magic.

Cheers,
Rusty.

2013-05-31 20:13:13

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] Generic percpu refcounting

On Wed, May 29, 2013 at 02:29:56PM +0930, Rusty Russell wrote:
> Kent Overstreet <[email protected]> writes:
> > I'm not sure I know of any good way of explaining it intuitively, but
> > here's this at least...
> >
> > * (More precisely: because moduler arithmatic is commutative the sum of all the
> > * pcpu_count vars will be equal to what it would have been if all the gets and
> > * puts were done to a single integer, even if some of the percpu integers
> > * overflow or underflow).
>
> This seems intuitively obvious, so I wouldn't sweat it too much. What
> goes up, has to come down somewhere.

I agree, but it seems there's a fair amount of disagreement over what's
intuitive :)

> Yes. We should note the 31 bit limit somewhere. We could WARN_ON() if
> count is >= BIAS in percpu_ref_kill(), perhaps.

I'd be hesitant about that - that WARN_ON() would work for this version
(I think) but it'd be incorrect for dynamic percpu refcounting, for
reasons that are almost accidental. And that WARN_ON() isn't going to
fire in anything but the most retarded torture testing.

Besides that, it's hard to imagine a situation where a range of 1 << 32
would be ok but a range of 1 << 31 wouldn't... if we need a WARN_ON()
here we need one for regular atomic_t too, but I don't see either buying
us much.

Also, if/when this is used for something where the range does matter
I'll just switch it to unsigned long (been debating doing that now, but
the aio code was using at atomic_t so I don't really care yet).

It should be documented though - I'll do that.

> >> I probably should have made it clearer. Sorry about that. tryget()
> >> is fine. I was curious about count() as it's always a bit dangerous a
> >> query interface which is racy and can return something unexpected like
> >> false zero or underflowed refcnt.
> >
> > Yeah, it is, it was intended just for the module code where it's only
> > used for the value lsmod shows.
>
> Open code it there?

Maybe justified for this, but I'm not a fan of open coding anything that
could be considered library/utility code... better to just document it
with ALL CAPS WARNINGS about being dangerous if used incorrectly.

But we can revisit that if/when the module refcount conversion is done.

> >> Let's just have percpu_ref_kill(ref, release) which puts the base ref
> >> and invokes release whenever it's done.
> >
> > Release has to be stored in struct percpu_ref() so it can be invoked
> > after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> > percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> > percpu_ref_init(), but yeah.
>
> Or hand it to percpu_ref_put(), too, as per kref_put(). I hate indirect
> magic.

The indirect magic is unfortunately necessary because percpu_ref_kill()
has to do a put after a call_rcu().

If the indirect magic wasn't needed I'd prefer to not pass a release
function to anything and just have percpu_ref_put() return bool, but
Tejun disagrees and it's a moot point anyways.

2013-05-31 22:52:23

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 21/21] block: Bio cancellation

On Wed, May 15, 2013 at 10:01:22PM +0200, Jens Axboe wrote:
> On Wed, May 15 2013, Kent Overstreet wrote:
> > It's only implemented for aio in this patch but it's actually completely
> > trivial to extend to sync kiocbs too - we can make killing a process
> > cancel outstanding sync DIOs, I just haven't gotten around to writing
> > the code. With sync kiocbs anything can use it.
>
> Oh, that wasn't even my point. It only works for iocb "backed" bios was
> my point. You would ideally like cancel for other areas as well. One
> that comes to mind is truncating files, for instance.

Sorry, I was unclear - the point was, there's nothing special about
kiocbs - if some random code (truncate related, say) wants to be able to
cancel some bios, it would just stick a kiocb somewhere (on the stack,
or wherever) and point the bios at that - the kiocb would be used for
cancellation and nothing else. All the code has to do is make sure the
kiocb can't be freed until the bios return, naturally.

If we decide struct kiocb is too big/ugly to use it this way we could
easily abstract out a "struct cancel" or something that's smaller,
though since kiocbs are already somewhat generic (see the way sync
kiocbs are used) I don't think it matters that much.

As part of the aio stuff I've been pruning struct kiocb as much as I
can, so this type of usage will make more sense and struct kiocb will be
~70 bytes instead of > 200.

> > I do hate to grow struct bio, but the aio attribute stuff I'm also
> > working on is going to need the same damn thing.
>
> If you (you being aio here) wants to support cancel, then why not just
> stuff it into bi_private?

Core block layer code (i.e. where we check if a bio/request has been
cancelled) can't depend on bi_private pointing to anything in
particular, that'd be a massive change.

_Arguably_ the right thing to do would be to, instead of having a void
bi_private pointer, have a pointer to a "struct bio_state" or somesuch -
and the owner of the bio would then embed struct bio_state into whatever
bi_private currently points to.

But that'd be a pretty massive change and I'm not sure it's the correct
approach.

> > Yeah, that's the only sane way to do it imo. If we had to do it with the
> > ki_cancel callback, since bios -> kiocbs isn't 1:1 we'd have to keep all
> > the outstanding bios on a list protected by a lock so we could chase
> > down all the bios we need to cancel, and I don't even want to think
> > about stacking devices...
>
> Perfection is the enemy of good. Doing tracking across the full stack is
> just going to be insane, just don't do it...

Completely agree, was just explaining how insane it'd be :P

> > > Pretty hacky too, given that it only works for the generic case of a
> > > non-merged bio.
> >
> > More incomplete than hacky, imo - since with spinning disks you wouldn't
> > save much by cancelling one bio out of a merged request. It would make
> > sense to cancel the request if all the bios have been cancelled, but
> > wanted to start out simple and get something useful with a minimal
> > amount of code.
> >
> > Anyways, this patch is still more at the RFC stage but there is serious
> > demand for cancellation (I've seen what people are using it for, it's
> > not all crazy and the lack of it is something people are working around
> > today, painfully).
>
> I'd be willing to entertain the idea, if the implementation is low
> enough overhead and makes sense. So not completely nacking the idea, I'd
> just prefer to see something a bit more baked.

Besides adding a real request_cancelled() function, I'm not sure what
else there is to flesh out at this time - anything else I can think of
adding should IMO wait until there's real use for it.

One thing I wasn't sure about was whether blk_peek_request() was the
right place to check if the request has been cancelled - I don't know
the request queue side of things all that well. Any opinion there?

2013-06-10 23:20:36

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On Wed, May 15, 2013 at 05:41:21PM +0200, Oleg Nesterov wrote:
> On 05/15, Kent Overstreet wrote:
> >
> > On Tue, May 14, 2013 at 03:48:59PM +0200, Oleg Nesterov wrote:
> > > tag_free() does
> > >
> > > list_del_init(wait->list);
> > > /* WINDOW */
> > > wake_up_process(wait->task);
> > >
> > > in theory the caller of tag_alloc() can notice list_empty_careful(),
> > > return without taking pool->lock, exit, and free this task_struct.
> > >
> > > But the main problem is that it is not clear why this code reimplements
> > > add_wait_queue/wake_up_all, for what?
> >
> > To save on locking... there's really no point in another lock for the
> > wait queue. Could just use the wait queue lock instead I suppose, like
> > wait_event_interruptible_locked()
>
> Yes. Or perhaps you can reuse wait_queue_head_t->lock for move_tags().
>
> And,
>
> > (the extra spin_lock()/unlock() might not really cost anything but
> > nested irqsave()/restore() is ridiculously expensive, IME).
>
> But this is the slow path anyway. Even if you do not use _locked, how
> much this extra locking (save/restore) can make the things worse?
>
> In any case, I believe it would be much better to reuse the code we
> already have, to avoid the races and make the code more understandable.
> And to not bloat the code.
>
> Do you really think that, say,
>
> unsigned tag_alloc(struct tag_pool *pool, bool wait)
> {
> struct tag_cpu_freelist *tags;
> unsigned ret = 0;
> retry:
> tags = get_cpu_ptr(pool->tag_cpu);
> local_irq_disable();
> if (!tags->nr_free && pool->nr_free) {
> spin_lock(&pool->wq.lock);
> if (pool->nr_free)
> move_tags(...);
> spin_unlock(&pool->wq.lock);
> }
>
> if (tags->nr_free)
> ret = tags->free[--tags->nr_free];
> local_irq_enable();
> put_cpu_var(pool->tag_cpu);
>
> if (ret || !wait)
> return ret;
>
> __wait_event(&pool->wq, pool->nr_free);
> goto retry;
> }
>
> will be much slower?

The overhead from doing nested irqsave/restore() sucks. I've had it bite
me hard with the recent aio work. But screw it, it's not going to matter
that much here.

>
> > > I must admit, I do not understand what this code actually does ;)
> > > I didn't try to read it carefully though, but perhaps at least the
> > > changelog could explain more?
> >
> > The changelog is admittedly terse, but that's basically all there is to
> > it -
> > [...snip...]
>
> Yes, thanks for your explanation, I already realized what it does...
>
> Question. tag_free() does move_tags+wakeup if nr_free = pool->watermark * 2.
> Perhaps it should should also take waitqueue_active() into account ?
> tag_alloc() can sleep more than necessary, it seems.

No.

By "sleeping more than necessary" you mean sleeping when there's tags
available on other percpu freelists.

That's just unavoidable if the thing's to be percpu - efficient use of
available tags requires global knowledge. Sleeping less would require
more global cacheline contention, and would defeat the purpose of this
code.

So what we do is _bound_ that inefficiency - we cap the size of the
percpu freelists so that no more than half of the available tags can be
stuck on all the percpu freelists.

This means from the POV of work executing on one cpu, they will always
be able to use up to half the total tags (assuming they aren't
actually allocated).

So when you're deciding how many tag structs to allocate, you just
double the number you'd allocate otherwise when you're using this code.

2013-06-11 17:46:45

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 17/21] Percpu tag allocator

On 06/10, Kent Overstreet wrote:
>
> On Wed, May 15, 2013 at 05:41:21PM +0200, Oleg Nesterov wrote:
> >
> > Do you really think that, say,
> >
> > unsigned tag_alloc(struct tag_pool *pool, bool wait)
> > {
> > struct tag_cpu_freelist *tags;
> > unsigned ret = 0;
> > retry:
> > tags = get_cpu_ptr(pool->tag_cpu);
> > local_irq_disable();
> > if (!tags->nr_free && pool->nr_free) {
> > spin_lock(&pool->wq.lock);
> > if (pool->nr_free)
> > move_tags(...);
> > spin_unlock(&pool->wq.lock);
> > }
> >
> > if (tags->nr_free)
> > ret = tags->free[--tags->nr_free];
> > local_irq_enable();
> > put_cpu_var(pool->tag_cpu);
> >
> > if (ret || !wait)
> > return ret;
> >
> > __wait_event(&pool->wq, pool->nr_free);
> > goto retry;
> > }
> >
> > will be much slower?
>
> The overhead from doing nested irqsave/restore() sucks. I've had it bite
> me hard with the recent aio work.

Not sure I understand... Only __wait_event() does irqsave/restore and
we are going to sleep anyway.

> But screw it, it's not going to matter
> that much here.

Yes.

And, imho, even if we need some optimizations here, it would be better
to make a separate patch backed by the numbers or at least the detailed
explanation.

> > Question. tag_free() does move_tags+wakeup if nr_free = pool->watermark * 2.
> > Perhaps it should should also take waitqueue_active() into account ?
> > tag_alloc() can sleep more than necessary, it seems.
>
> No.
>
> By "sleeping more than necessary" you mean sleeping when there's tags
> available on other percpu freelists.

Yes,

> That's just unavoidable if the thing's to be percpu - efficient use of
> available tags requires global knowledge. Sleeping less would require
> more global cacheline contention, and would defeat the purpose of this
> code.

Yes, yes, I understand, there is a tradeoff. Just it is still not clear
to me what would be better "in practice"... So,

> So when you're deciding how many tag structs to allocate, you just
> double the number you'd allocate otherwise when you're using this code.

I am not sure this is really needed.

But OK, I see your point, thanks.

Oleg.