LinuxLists.cc - [PATCH 1/2] [v2] blk-mq: fix plugging in blk_sq_make

2015-04-06 19:15:11

Subject: [PATCH 1/2] [v2] blk-mq: fix plugging in blk_sq_make_request

The following appears in blk_sq_make_request:

/*
* If we have multiple hardware queues, just go directly to
* one of those for sync IO.
*/

We clearly don't have multiple hardware queues, here! This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.

The old code had this:

use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

use_plug = !is_flush_fua && !is_sync;

which is not equivalent. For the single queue case, that second half of
the && expression is always true. So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

Signed-off-by: Jeff Moyer <[email protected]>

Changelog
v1->v2: removed the accidental addition of merging for flush/fua requests
---
block/blk-mq.c | 36 ++++++++++++++----------------------
1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b7b8933..cca5097 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1306,16 +1306,11 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = rw_is_sync(bio->bi_rw);
const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
- unsigned int use_plug, request_count = 0;
+ struct blk_plug *plug;
+ unsigned int request_count = 0;
struct blk_map_ctx data;
struct request *rq;

- /*
- * If we have multiple hardware queues, just go directly to
- * one of those for sync IO.
- */
- use_plug = !is_flush_fua && !is_sync;
-
blk_queue_bounce(q, &bio);

if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
@@ -1323,7 +1318,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
return;
}

- if (use_plug && !blk_queue_nomerges(q) &&
+ if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count))
return;

@@ -1342,21 +1337,18 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
* utilize that to temporarily store requests until the task is
* either done or scheduled away.
*/
- if (use_plug) {
- struct blk_plug *plug = current->plug;
-
- if (plug) {
- blk_mq_bio_to_request(rq, bio);
- if (list_empty(&plug->mq_list))
- trace_block_plug(q);
- else if (request_count >= BLK_MAX_REQUEST_COUNT) {
- blk_flush_plug_list(plug, false);
- trace_block_plug(q);
- }
- list_add_tail(&rq->queuelist, &plug->mq_list);
- blk_mq_put_ctx(data.ctx);
- return;
+ plug = current->plug;
+ if (plug) {
+ blk_mq_bio_to_request(rq, bio);
+ if (list_empty(&plug->mq_list))
+ trace_block_plug(q);
+ else if (request_count >= BLK_MAX_REQUEST_COUNT) {
+ blk_flush_plug_list(plug, false);
+ trace_block_plug(q);
}
+ list_add_tail(&rq->queuelist, &plug->mq_list);
+ blk_mq_put_ctx(data.ctx);
+ return;
}

if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
--
1.8.3.1

2015-04-06 19:15:08

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 2/2] blk-plug: don't flush nested plug lists

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os. This can be less than optimal (read
awful) for certain workloads. For example, consider an application
that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowells of the dirct I/O code.
The reason is that there are blk_plug's instantiated both at the upper
call site in do_io_submit and down in do_direct_IO. The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much. The elevator will keep the I/O around long
enough for it to be merged. However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example (note I'm using a virtio-blk device, so it's
using the blk-mq single queue path, though I've also reproduced this
with the micron p320h):

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges. Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test rig:

Unpatched kernel:
Read B/W: 283,638 KB/s
Read Merges: 0

Patched kernel:
Read B/W: 873,224 KB/s
Read Merges: 2,046K

I considered several approaches to solving the problem:
1) get rid of the inner-most plugs
2) handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements. Option 2a is perhaps a better
idea, but since I already implemented option 2, I figured I'd post it
for comments and opinions before rewriting it.

Much of the patch involves modifying call sites to blk_start_plug,
since its signature is changed. The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h. The only tricky bits were places where plugs
were finished and then restarted to flush out I/O. There, I went
ahead and exported blk_flush_plug_list and called that directly.

Comments would be greatly appreciated.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/blk-core.c | 33 +++++++++++++++++++--------------
block/blk-lib.c | 6 +++---
block/blk-throttle.c | 6 +++---
drivers/block/xen-blkback/blkback.c | 6 +++---
drivers/md/dm-bufio.c | 15 +++++++--------
drivers/md/dm-kcopyd.c | 6 +++---
drivers/md/dm-thin.c | 6 +++---
drivers/md/md.c | 6 +++---
drivers/md/raid1.c | 6 +++---
drivers/md/raid10.c | 6 +++---
drivers/md/raid5.c | 12 ++++++------
drivers/target/target_core_iblock.c | 6 +++---
fs/aio.c | 6 +++---
fs/block_dev.c | 6 +++---
fs/btrfs/scrub.c | 6 +++---
fs/btrfs/transaction.c | 6 +++---
fs/btrfs/tree-log.c | 16 ++++++++--------
fs/btrfs/volumes.c | 12 +++++-------
fs/buffer.c | 6 +++---
fs/direct-io.c | 8 +++++---
fs/ext4/file.c | 6 +++---
fs/ext4/inode.c | 12 +++++-------
fs/f2fs/checkpoint.c | 6 +++---
fs/f2fs/gc.c | 6 +++---
fs/f2fs/node.c | 6 +++---
fs/jbd/checkpoint.c | 6 +++---
fs/jbd/commit.c | 10 +++++-----
fs/jbd2/checkpoint.c | 6 +++---
fs/jbd2/commit.c | 6 +++---
fs/mpage.c | 6 +++---
fs/xfs/xfs_buf.c | 12 ++++++------
fs/xfs/xfs_dir2_readdir.c | 6 +++---
fs/xfs/xfs_itable.c | 6 +++---
include/linux/blkdev.h | 3 ++-
mm/madvise.c | 6 +++---
mm/page-writeback.c | 6 +++---
mm/readahead.c | 6 +++---
mm/swap_state.c | 6 +++---
mm/vmscan.c | 6 +++---
39 files changed, 155 insertions(+), 152 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..64f3f2a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3002,7 +3002,7 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);

/**
* blk_start_plug - initialize blk_plug and track it inside the task_struct
- * @plug: The &struct blk_plug that needs to be initialized
+ * @plug: The on-stack &struct blk_plug that needs to be initialized
*
* Description:
* Tracking blk_plug inside the task_struct will help with auto-flushing the
@@ -3013,26 +3013,29 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
* page belonging to that request that is currently residing in our private
* plug. By flushing the pending I/O when the process goes to sleep, we avoid
* this kind of deadlock.
+ *
+ * Returns: a pointer to the active &struct blk_plug
*/
-void blk_start_plug(struct blk_plug *plug)
+struct blk_plug *blk_start_plug(struct blk_plug *plug)
{
struct task_struct *tsk = current;

+ if (tsk->plug) {
+ tsk->plug->depth++;
+ return tsk->plug;
+ }
+
+ plug->depth = 1;
INIT_LIST_HEAD(&plug->list);
INIT_LIST_HEAD(&plug->mq_list);
INIT_LIST_HEAD(&plug->cb_list);

/*
- * If this is a nested plug, don't actually assign it. It will be
- * flushed on its own.
+ * Store ordering should not be needed here, since a potential
+ * preempt will imply a full memory barrier
*/
- if (!tsk->plug) {
- /*
- * Store ordering should not be needed here, since a potential
- * preempt will imply a full memory barrier
- */
- tsk->plug = plug;
- }
+ tsk->plug = plug;
+ return tsk->plug;
}
EXPORT_SYMBOL(blk_start_plug);

@@ -3176,13 +3179,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)

local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(blk_flush_plug_list);

void blk_finish_plug(struct blk_plug *plug)
{
- blk_flush_plug_list(plug, false);
+ if (--plug->depth > 0)
+ return;

- if (plug == current->plug)
- current->plug = NULL;
+ blk_flush_plug_list(plug, false);
+ current->plug = NULL;
}
EXPORT_SYMBOL(blk_finish_plug);

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..e2d2448 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -48,7 +48,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
struct bio_batch bb;
struct bio *bio;
int ret = 0;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

if (!q)
return -ENXIO;
@@ -81,7 +81,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
bb.flags = 1 << BIO_UPTODATE;
bb.wait = &wait;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while (nr_sects) {
unsigned int req_sects;
sector_t end_sect, tmp;
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
*/
cond_resched();
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

/* Wait for bios in-flight */
if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..f57bbd3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1266,7 +1266,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
struct request_queue *q = td->queue;
struct bio_list bio_list_on_stack;
struct bio *bio;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int rw;

bio_list_init(&bio_list_on_stack);
@@ -1278,10 +1278,10 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
spin_unlock_irq(q->queue_lock);

if (!bio_list_empty(&bio_list_on_stack)) {
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while((bio = bio_list_pop(&bio_list_on_stack)))
generic_make_request(bio);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}
}

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..f075182 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1207,7 +1207,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
struct bio **biolist = pending_req->biolist;
int i, nbio = 0;
int operation;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
bool drain = false;
struct grant_page **pages = pending_req->segments;
unsigned short req_operation;
@@ -1368,13 +1368,13 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
}

atomic_set(&pending_req->pendcnt, nbio);
- blk_start_plug(&plug);
+ blk_start_plug(onstack_plug);

for (i = 0; i < nbio; i++)
submit_bio(operation, biolist[i]);

/* Let the I/Os go.. */
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

if (operation == READ)
blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..a6cbc6f 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -706,8 +706,8 @@ static void __write_dirty_buffer(struct dm_buffer *b,

static void __flush_write_list(struct list_head *write_list)
{
- struct blk_plug plug;
- blk_start_plug(&plug);
+ struct blk_plug *plug, onstack_plug;
+ plug = blk_start_plug(&onstack_plug);
while (!list_empty(write_list)) {
struct dm_buffer *b =
list_entry(write_list->next, struct dm_buffer, write_list);
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
submit_io(b, WRITE, b->block, write_endio);
dm_bufio_cond_resched();
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

/*
@@ -1110,13 +1110,13 @@ EXPORT_SYMBOL_GPL(dm_bufio_new);
void dm_bufio_prefetch(struct dm_bufio_client *c,
sector_t block, unsigned n_blocks)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

LIST_HEAD(write_list);

BUG_ON(dm_bufio_in_request());

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
dm_bufio_lock(c);

for (; n_blocks--; block++) {
@@ -1126,9 +1126,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
&write_list);
if (unlikely(!list_empty(&write_list))) {
dm_bufio_unlock(c);
- blk_finish_plug(&plug);
__flush_write_list(&write_list);
- blk_start_plug(&plug);
+ blk_flush_plug_list(plug, false);
dm_bufio_lock(c);
}
if (unlikely(b != NULL)) {
@@ -1149,7 +1148,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
dm_bufio_unlock(c);

flush_plug:
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}
EXPORT_SYMBOL_GPL(dm_bufio_prefetch);

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..fed371f 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -580,7 +580,7 @@ static void do_work(struct work_struct *work)
{
struct dm_kcopyd_client *kc = container_of(work,
struct dm_kcopyd_client, kcopyd_work);
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

/*
* The order that these are called is *very* important.
@@ -589,11 +589,11 @@ static void do_work(struct work_struct *work)
* list. io jobs call wake when they complete and it all
* starts again.
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
process_jobs(&kc->complete_jobs, kc, run_complete_job);
process_jobs(&kc->pages_jobs, kc, run_pages_job);
process_jobs(&kc->io_jobs, kc, run_io_job);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..6a7459a 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1775,7 +1775,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
unsigned long flags;
struct bio *bio;
struct bio_list bios;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
unsigned count = 0;

if (tc->requeue_mode) {
@@ -1799,7 +1799,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)

spin_unlock_irqrestore(&tc->lock, flags);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while ((bio = bio_list_pop(&bios))) {
/*
* If we've got no free new_mapping structs, and processing
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
dm_pool_issue_prefetches(pool->pmd);
}
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..9f24719 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7402,7 +7402,7 @@ void md_do_sync(struct md_thread *thread)
int skipped = 0;
struct md_rdev *rdev;
char *desc, *action = NULL;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

/* just incase thread restarts... */
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
@@ -7572,7 +7572,7 @@ void md_do_sync(struct md_thread *thread)
md_new_event(mddev);
update_time = jiffies;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while (j < max_sectors) {
sector_t sectors;

@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
/*
* this also signals 'finished resyncing' to md_stop
*/
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));

/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..2608bc3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2399,11 +2399,11 @@ static void raid1d(struct md_thread *thread)
unsigned long flags;
struct r1conf *conf = mddev->private;
struct list_head *head = &conf->retry_list;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

md_check_recovery(mddev);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (;;) {

flush_pending_writes(conf);
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
md_check_recovery(mddev);
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..7bea5c7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2791,11 +2791,11 @@ static void raid10d(struct md_thread *thread)
unsigned long flags;
struct r10conf *conf = mddev->private;
struct list_head *head = &conf->retry_list;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

md_check_recovery(mddev);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (;;) {

flush_pending_writes(conf);
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
md_check_recovery(mddev);
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..59e7090 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5259,11 +5259,11 @@ static void raid5_do_work(struct work_struct *work)
struct r5conf *conf = group->conf;
int group_id = group - conf->worker_groups;
int handled;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

pr_debug("+++ raid5worker active\n");

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
handled = 0;
spin_lock_irq(&conf->device_lock);
while (1) {
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
pr_debug("%d stripes handled\n", handled);

spin_unlock_irq(&conf->device_lock);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

pr_debug("--- raid5worker inactive\n");
}
@@ -5298,13 +5298,13 @@ static void raid5d(struct md_thread *thread)
struct mddev *mddev = thread->mddev;
struct r5conf *conf = mddev->private;
int handled;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

pr_debug("+++ raid5d active\n");

md_check_recovery(mddev);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
handled = 0;
spin_lock_irq(&conf->device_lock);
while (1) {
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
spin_unlock_irq(&conf->device_lock);

async_tx_issue_pending_all();
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

pr_debug("--- raid5d inactive\n");
}
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..248a6bb 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -361,13 +361,13 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)

static void iblock_submit_bios(struct bio_list *list, int rw)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
struct bio *bio;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while ((bio = bio_list_pop(list)))
submit_bio(rw, bio);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b1c3583 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1575,7 +1575,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
struct kioctx *ctx;
long ret = 0;
int i = 0;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

if (unlikely(nr < 0))
return -EINVAL;
@@ -1592,7 +1592,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
return -EINVAL;
}

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

/*
* AKPM: should this return a partial result if some of the IOs were
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
if (ret)
break;
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

percpu_ref_put(&ctx->users);
return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..928d3e0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1598,10 +1598,10 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
ssize_t ret;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
ret = __generic_file_write_iter(iocb, from);
if (ret > 0) {
ssize_t err;
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err < 0)
ret = err;
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
return ret;
}
EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..768bac3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2967,7 +2967,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
struct btrfs_root *root = fs_info->extent_root;
struct btrfs_root *csum_root = fs_info->csum_root;
struct btrfs_extent_item *extent;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
u64 flags;
int ret;
int slot;
@@ -3088,7 +3088,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
* collect all data csums for the stripe to avoid seeking during
* the scrub. This might currently (crc32) end up to be about 1MB
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

/*
* now find all extents for each stripe and scrub them
@@ -3316,7 +3316,7 @@ out:
scrub_wr_submit(sctx);
mutex_unlock(&sctx->wr_ctx.wr_lock);

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
btrfs_free_path(path);
btrfs_free_path(ppath);
return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..b5a8078 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -979,11 +979,11 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
{
int ret;
int ret2;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
ret = btrfs_write_marked_extents(root, dirty_pages, mark);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);

if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..1623924 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2519,7 +2519,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
int log_transid = 0;
struct btrfs_log_ctx root_log_ctx;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

mutex_lock(&root->log_mutex);
log_transid = ctx->log_transid;
@@ -2571,10 +2571,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
/* we start IO on all the marked extents here, but we don't actually
* wait for them until later.
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
if (ret) {
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
btrfs_abort_transaction(trans, root, ret);
btrfs_free_logged_extents(log, log_transid);
btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
if (!list_empty(&root_log_ctx.list))
list_del_init(&root_log_ctx.list);

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
btrfs_set_log_full_commit(root->fs_info, trans);

if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
}

if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
mutex_unlock(&log_root_tree->log_mutex);
ret = root_log_ctx.log_ret;
goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,

index2 = root_log_ctx.log_transid % 2;
if (atomic_read(&log_root_tree->log_commit[index2])) {
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
mark);
btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
* check the full commit flag again
*/
if (btrfs_need_log_full_commit(root->fs_info, trans)) {
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
btrfs_free_logged_extents(log, log_transid);
mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
ret = btrfs_write_marked_extents(log_root_tree,
&log_root_tree->dirty_log_pages,
EXTENT_DIRTY | EXTENT_NEW);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
if (ret) {
btrfs_set_log_full_commit(root->fs_info, trans);
btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..0e215ff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -261,7 +261,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
unsigned long last_waited = 0;
int force_reg = 0;
int sync_pending = 0;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

/*
* this function runs all the bios we've collected for
@@ -269,7 +269,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
* another device without first sending all of these down.
* So, setup a plug here and finish it off before we return
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

bdi = blk_get_backing_dev_info(device->bdev);
fs_info = device->dev_root->fs_info;
@@ -358,8 +358,7 @@ loop_lock:
if (pending_bios == &device->pending_sync_bios) {
sync_pending = 1;
} else if (sync_pending) {
- blk_finish_plug(&plug);
- blk_start_plug(&plug);
+ blk_flush_plug_list(plug, false);
sync_pending = 0;
}

@@ -415,8 +414,7 @@ loop_lock:
}
/* unplug every 64 requests just for good measure */
if (batch_run % 64 == 0) {
- blk_finish_plug(&plug);
- blk_start_plug(&plug);
+ blk_flush_plug_list(plug, false);
sync_pending = 0;
}
}
@@ -431,7 +429,7 @@ loop_lock:
spin_unlock(&device->io_lock);

done:
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..727d642 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -717,10 +717,10 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
struct list_head tmp;
struct address_space *mapping;
int err = 0, err2;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

INIT_LIST_HEAD(&tmp);
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

spin_lock(lock);
while (!list_empty(list)) {
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
}

spin_unlock(lock);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
spin_lock(lock);

while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e79c0c6 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -488,6 +488,8 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
static void dio_await_completion(struct dio *dio)
{
struct bio *bio;
+
+ blk_flush_plug(current);
do {
bio = dio_await_one(dio);
if (bio)
@@ -1108,7 +1110,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
struct dio *dio;
struct dio_submit sdio = { 0, };
struct buffer_head map_bh = { 0, };
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
unsigned long align = offset | iov_iter_alignment(iter);

if (rw & WRITE)
@@ -1231,7 +1233,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,

sdio.pages_in_io += iov_iter_npages(iter, INT_MAX);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

retval = do_direct_IO(dio, &sdio, &map_bh);
if (retval)
@@ -1262,7 +1264,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
if (sdio.bio)
dio_bio_submit(dio, &sdio);

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

/*
* It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..fd0cf21 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(iocb->ki_filp);
struct mutex *aio_mutex = NULL;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int o_direct = io_is_direct(file);
int overwrite = 0;
size_t length = iov_iter_count(from);
@@ -139,7 +139,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)

iocb->private = &overwrite;
if (o_direct) {
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

/* check whether we do a DIO overwrite or not */
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = err;
}
if (o_direct)
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

errout:
if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..d4b645b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2284,7 +2284,7 @@ static int ext4_writepages(struct address_space *mapping,
int needed_blocks, rsv_blocks = 0, ret = 0;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
bool done;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
bool give_up_on_write = false;

trace_ext4_writepages(inode, wbc);
@@ -2298,11 +2298,9 @@ static int ext4_writepages(struct address_space *mapping,
goto out_writepages;

if (ext4_should_journal_data(inode)) {
- struct blk_plug plug;
-
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
ret = write_cache_pages(mapping, wbc, __writepage, mapping);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
goto out_writepages;
}

@@ -2368,7 +2366,7 @@ retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
done = false;
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while (!done && mpd.first_page <= mpd.last_page) {
/* For each extent of pages we use new io_end */
mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
@@ -2438,7 +2436,7 @@ retry:
if (ret)
break;
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
if (!ret && !cycled && wbc->nr_to_write > 0) {
cycled = 1;
mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..de2b522 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -810,10 +810,10 @@ static int block_operations(struct f2fs_sb_info *sbi)
.nr_to_write = LONG_MAX,
.for_reclaim = 0,
};
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int err = 0;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

retry_flush_dents:
f2fs_lock_all(sbi);
@@ -846,7 +846,7 @@ retry_flush_nodes:
goto retry_flush_nodes;
}
out:
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
return err;
}

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..082e961 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -661,12 +661,12 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
{
struct page *sum_page;
struct f2fs_summary_block *sum;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

/* read segment summary of victim */
sum_page = get_sum_page(sbi, segno);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

sum = page_address(sum_page);

@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
break;
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..e1fc81e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1066,7 +1066,7 @@ repeat:
struct page *get_node_page_ra(struct page *parent, int start)
{
struct f2fs_sb_info *sbi = F2FS_P_SB(parent);
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
struct page *page;
int err, i, end;
nid_t nid;
@@ -1086,7 +1086,7 @@ repeat:
else if (err == LOCKED_PAGE)
goto page_hit;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

/* Then, try readahead for siblings of the desired node */
end = start + MAX_RA_NODE;
@@ -1098,7 +1098,7 @@ repeat:
ra_node_page(sbi, nid);
}

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

lock_page(page);
if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..29b550d 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -258,12 +258,12 @@ static void
__flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
{
int i;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (i = 0; i < *batch_count; i++)
write_dirty_buffer(bhs[i], WRITE_SYNC);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

for (i = 0; i < *batch_count; i++) {
struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..ccb32f0 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -311,7 +311,7 @@ void journal_commit_transaction(journal_t *journal)
int first_tag = 0;
int tag_flag;
int i;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int write_op = WRITE;

/*
@@ -444,10 +444,10 @@ void journal_commit_transaction(journal_t *journal)
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
err = journal_submit_data_buffers(journal, commit_transaction,
write_op);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

/*
* Wait for all previously submitted IO to complete.
@@ -503,7 +503,7 @@ void journal_commit_transaction(journal_t *journal)
err = 0;
}

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

journal_write_revoke_records(journal, commit_transaction, write_op);

@@ -697,7 +697,7 @@ start_journal_io:
}
}

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

/* Lo and behold: we have just managed to send a transaction to
the log. Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..31b5e73 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -182,12 +182,12 @@ static void
__flush_batch(journal_t *journal, int *batch_count)
{
int i;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (i = 0; i < *batch_count; i++)
write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

for (i = 0; i < *batch_count; i++) {
struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..713d26a 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -390,7 +390,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_bytes = journal_tag_bytes(journal);
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
/* Tail of the journal */
unsigned long first_block;
tid_t first_tid;
@@ -555,7 +555,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
if (err)
jbd2_journal_abort(journal, err);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
jbd2_journal_write_revoke_records(journal, commit_transaction,
&log_bufs, WRITE_SYNC);

@@ -805,7 +805,7 @@ start_journal_io:
__jbd2_journal_abort_hard(journal);
}

- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

/* Lo and behold: we have just managed to send a transaction to
the log. Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..926dc42 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -676,10 +676,10 @@ int
mpage_writepages(struct address_space *mapping,
struct writeback_control *wbc, get_block_t get_block)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int ret;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

if (!get_block)
ret = generic_writepages(mapping, wbc);
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
if (mpd.bio)
mpage_bio_submit(WRITE, mpd.bio);
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
return ret;
}
EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..776ac5a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1208,7 +1208,7 @@ STATIC void
_xfs_buf_ioapply(
struct xfs_buf *bp)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int rw;
int offset;
int size;
@@ -1281,7 +1281,7 @@ _xfs_buf_ioapply(
*/
offset = bp->b_offset;
size = BBTOB(bp->b_io_length);
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (i = 0; i < bp->b_map_count; i++) {
xfs_buf_ioapply_map(bp, i, &offset, &size, rw);
if (bp->b_error)
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
if (size <= 0)
break; /* all done */
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

/*
@@ -1772,7 +1772,7 @@ __xfs_buf_delwri_submit(
struct list_head *io_list,
bool wait)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
struct xfs_buf *bp, *n;
int pinned = 0;

@@ -1806,7 +1806,7 @@ __xfs_buf_delwri_submit(

list_sort(NULL, io_list, xfs_buf_cmp);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
list_for_each_entry_safe(bp, n, io_list, b_list) {
bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
bp->b_flags |= XBF_WRITE | XBF_ASYNC;
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(

xfs_buf_submit(bp);
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

return pinned;
}
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..6074671 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -275,7 +275,7 @@ xfs_dir2_leaf_readbuf(
struct xfs_inode *dp = args->dp;
struct xfs_buf *bp = *bpp;
struct xfs_bmbt_irec *map = mip->map;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int error = 0;
int length;
int i;
@@ -404,7 +404,7 @@ xfs_dir2_leaf_readbuf(
/*
* Do we need more readahead?
*/
- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (mip->ra_index = mip->ra_offset = i = 0;
mip->ra_want > mip->ra_current && i < mip->map_blocks;
i += geo->fsbcount) {
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
}
}
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

out:
*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..b890aa4 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -179,7 +179,7 @@ xfs_bulkstat_ichunk_ra(
struct xfs_inobt_rec_incore *irec)
{
xfs_agblock_t agbno;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int blks_per_cluster;
int inodes_per_cluster;
int i; /* inode chunk index */
@@ -188,7 +188,7 @@ xfs_bulkstat_ichunk_ra(
blks_per_cluster = xfs_icluster_size_fsb(mp);
inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (i = 0; i < XFS_INODES_PER_CHUNK;
i += inodes_per_cluster, agbno += blks_per_cluster) {
if (xfs_inobt_maskn(i, inodes_per_cluster) & ~irec->ir_free) {
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
&xfs_inode_buf_ops);
}
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
}

/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..4617fce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
* schedule() where blk_schedule_flush_plug() is called.
*/
struct blk_plug {
+ int depth; /* number of nested plugs */
struct list_head list; /* requests */
struct list_head mq_list; /* blk-mq requests */
struct list_head cb_list; /* md requires an unplug callback */
@@ -1106,7 +1107,7 @@ struct blk_plug_cb {
};
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
void *data, int size);
-extern void blk_start_plug(struct blk_plug *);
+extern struct blk_plug *blk_start_plug(struct blk_plug *);
extern void blk_finish_plug(struct blk_plug *);
extern void blk_flush_plug_list(struct blk_plug *, bool);

diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..1f7d5ad 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -463,7 +463,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
int error = -EINVAL;
int write;
size_t len;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
@@ -503,7 +503,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
if (vma && start > vma->vm_start)
prev = vma;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (;;) {
/* Still start < end. */
error = -ENOMEM;
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
vma = find_vma(current->mm, start);
}
out:
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
if (write)
up_write(&current->mm->mmap_sem);
else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..9369d5e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2011,16 +2011,16 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
int generic_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
int ret;

/* deal with chardevs and other special file */
if (!mapping->a_ops->writepage)
return 0;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
ret = write_cache_pages(mapping, wbc, __writepage, mapping);
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
return ret;
}

diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..c3350b5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -111,11 +111,11 @@ EXPORT_SYMBOL(read_cache_pages);
static int read_pages(struct address_space *mapping, struct file *filp,
struct list_head *pages, unsigned nr_pages)
{
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
unsigned page_idx;
int ret;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);

if (mapping->a_ops->readpages) {
ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
ret = 0;

out:
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

return ret;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..3f6b8ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -455,7 +455,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;

mask = swapin_nr_pages(offset) - 1;
if (!mask)
@@ -467,7 +467,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
SetPageReadahead(page);
page_cache_release(page);
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);

lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..fd29974 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2130,7 +2130,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- struct blk_plug plug;
+ struct blk_plug *plug, onstack_plug;
bool scan_adjusted;

get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
@@ -2152,7 +2152,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
sc->priority == DEF_PRIORITY);

- blk_start_plug(&plug);
+ plug = blk_start_plug(&onstack_plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
unsigned long nr_anon, nr_file, percentage;
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

scan_adjusted = true;
}
- blk_finish_plug(&plug);
+ blk_finish_plug(plug);
sc->nr_reclaimed += nr_reclaimed;

/*
--
1.8.3.1

2015-04-07 09:19:41

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

Hi Jeff,

On Tue, Apr 7, 2015 at 3:14 AM, Jeff Moyer <[email protected]> wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os. This can be less than optimal (read
> awful) for certain workloads. For example, consider an application
> that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowells of the dirct I/O code.
> The reason is that there are blk_plug's instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO. The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.
>
> Now, for the case where there is an elevator involved, this doesn't
> really matter too much. The elevator will keep the I/O around long
> enough for it to be merged. However, in cases where there is no
> elevator (like blk-mq), I/Os are simply dispatched immediately.
>
> Try this, for example (note I'm using a virtio-blk device, so it's
> using the blk-mq single queue path, though I've also reproduced this
> with the micron p320h):
>
> fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
>
> If you run that on a current kernel, you will get zero merges. Zero!
> After this patch, you will get many merges (the actual number depends
> on how fast your storage is, obviously), and much better throughput.
> Here are results from my test rig:
>
> Unpatched kernel:
> Read B/W: 283,638 KB/s
> Read Merges: 0
>
> Patched kernel:
> Read B/W: 873,224 KB/s
> Read Merges: 2,046K

The data is amazing, but maybe better to provide some latency
data.

>
> I considered several approaches to solving the problem:
> 1) get rid of the inner-most plugs
> 2) handle nesting by using only one on-stack plug
> 2a) #2, except use a per-cpu blk_plug struct, which may clean up the
> code a bit at the expense of memory footprint
>
> Option 1 will be tricky or impossible to do, since inner most plug
> lists are sometimes the only plug lists, depending on the call path.
> Option 2 is what this patch implements. Option 2a is perhaps a better
> idea, but since I already implemented option 2, I figured I'd post it
> for comments and opinions before rewriting it.
>
> Much of the patch involves modifying call sites to blk_start_plug,
> since its signature is changed. The meat of the patch is actually

I am wondering if the type of blk_start_plug has to be changed
since the active plug is always the top plug with your patch, and
blk_finish_plug() can find the active plug from current->plug.

> pretty simple and constrained to block/blk-core.c and
> include/linux/blkdev.h. The only tricky bits were places where plugs
> were finished and then restarted to flush out I/O. There, I went
> ahead and exported blk_flush_plug_list and called that directly.
>
> Comments would be greatly appreciated.
>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> block/blk-core.c | 33 +++++++++++++++++++--------------
> block/blk-lib.c | 6 +++---
> block/blk-throttle.c | 6 +++---
> drivers/block/xen-blkback/blkback.c | 6 +++---
> drivers/md/dm-bufio.c | 15 +++++++--------
> drivers/md/dm-kcopyd.c | 6 +++---
> drivers/md/dm-thin.c | 6 +++---
> drivers/md/md.c | 6 +++---
> drivers/md/raid1.c | 6 +++---
> drivers/md/raid10.c | 6 +++---
> drivers/md/raid5.c | 12 ++++++------
> drivers/target/target_core_iblock.c | 6 +++---
> fs/aio.c | 6 +++---
> fs/block_dev.c | 6 +++---
> fs/btrfs/scrub.c | 6 +++---
> fs/btrfs/transaction.c | 6 +++---
> fs/btrfs/tree-log.c | 16 ++++++++--------
> fs/btrfs/volumes.c | 12 +++++-------
> fs/buffer.c | 6 +++---
> fs/direct-io.c | 8 +++++---
> fs/ext4/file.c | 6 +++---
> fs/ext4/inode.c | 12 +++++-------
> fs/f2fs/checkpoint.c | 6 +++---
> fs/f2fs/gc.c | 6 +++---
> fs/f2fs/node.c | 6 +++---
> fs/jbd/checkpoint.c | 6 +++---
> fs/jbd/commit.c | 10 +++++-----
> fs/jbd2/checkpoint.c | 6 +++---
> fs/jbd2/commit.c | 6 +++---
> fs/mpage.c | 6 +++---
> fs/xfs/xfs_buf.c | 12 ++++++------
> fs/xfs/xfs_dir2_readdir.c | 6 +++---
> fs/xfs/xfs_itable.c | 6 +++---
> include/linux/blkdev.h | 3 ++-
> mm/madvise.c | 6 +++---
> mm/page-writeback.c | 6 +++---
> mm/readahead.c | 6 +++---
> mm/swap_state.c | 6 +++---
> mm/vmscan.c | 6 +++---
> 39 files changed, 155 insertions(+), 152 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..64f3f2a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3002,7 +3002,7 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
>
> /**
> * blk_start_plug - initialize blk_plug and track it inside the task_struct
> - * @plug: The &struct blk_plug that needs to be initialized
> + * @plug: The on-stack &struct blk_plug that needs to be initialized
> *
> * Description:
> * Tracking blk_plug inside the task_struct will help with auto-flushing the
> @@ -3013,26 +3013,29 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
> * page belonging to that request that is currently residing in our private
> * plug. By flushing the pending I/O when the process goes to sleep, we avoid
> * this kind of deadlock.
> + *
> + * Returns: a pointer to the active &struct blk_plug
> */
> -void blk_start_plug(struct blk_plug *plug)
> +struct blk_plug *blk_start_plug(struct blk_plug *plug)

The change might not be necessary.

> {
> struct task_struct *tsk = current;
>
> + if (tsk->plug) {
> + tsk->plug->depth++;
> + return tsk->plug;
> + }
> +
> + plug->depth = 1;
> INIT_LIST_HEAD(&plug->list);
> INIT_LIST_HEAD(&plug->mq_list);
> INIT_LIST_HEAD(&plug->cb_list);
>
> /*
> - * If this is a nested plug, don't actually assign it. It will be
> - * flushed on its own.
> + * Store ordering should not be needed here, since a potential
> + * preempt will imply a full memory barrier
> */
> - if (!tsk->plug) {
> - /*
> - * Store ordering should not be needed here, since a potential
> - * preempt will imply a full memory barrier
> - */
> - tsk->plug = plug;
> - }
> + tsk->plug = plug;
> + return tsk->plug;

tsk->plug is always returned from this function, so it means tsk->plug is
always the active plug.

> }
> EXPORT_SYMBOL(blk_start_plug);
>
> @@ -3176,13 +3179,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>
> local_irq_restore(flags);
> }
> +EXPORT_SYMBOL_GPL(blk_flush_plug_list);
>
> void blk_finish_plug(struct blk_plug *plug)
> {
> - blk_flush_plug_list(plug, false);
> + if (--plug->depth > 0)
> + return;

The active plug should be current->plug, suppose blk_finish_plug()
is always paired with blk_start_plug().

>
> - if (plug == current->plug)
> - current->plug = NULL;
> + blk_flush_plug_list(plug, false);
> + current->plug = NULL;
> }
> EXPORT_SYMBOL(blk_finish_plug);
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..e2d2448 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -48,7 +48,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> struct bio_batch bb;
> struct bio *bio;
> int ret = 0;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> if (!q)
> return -ENXIO;
> @@ -81,7 +81,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> bb.flags = 1 << BIO_UPTODATE;
> bb.wait = &wait;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while (nr_sects) {
> unsigned int req_sects;
> sector_t end_sect, tmp;
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> */
> cond_resched();
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);

Like the comment above, blk_finish_plug() can figure out the
active plug, so looks unnecessary to introduce return value
to blk_start_plug().

>
> /* Wait for bios in-flight */
> if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..f57bbd3 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1266,7 +1266,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
> struct request_queue *q = td->queue;
> struct bio_list bio_list_on_stack;
> struct bio *bio;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int rw;
>
> bio_list_init(&bio_list_on_stack);
> @@ -1278,10 +1278,10 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
> spin_unlock_irq(q->queue_lock);
>
> if (!bio_list_empty(&bio_list_on_stack)) {
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while((bio = bio_list_pop(&bio_list_on_stack)))
> generic_make_request(bio);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
> }
>
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..f075182 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1207,7 +1207,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> struct bio **biolist = pending_req->biolist;
> int i, nbio = 0;
> int operation;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> bool drain = false;
> struct grant_page **pages = pending_req->segments;
> unsigned short req_operation;
> @@ -1368,13 +1368,13 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> }
>
> atomic_set(&pending_req->pendcnt, nbio);
> - blk_start_plug(&plug);
> + blk_start_plug(onstack_plug);
>
> for (i = 0; i < nbio; i++)
> submit_bio(operation, biolist[i]);
>
> /* Let the I/Os go.. */
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> if (operation == READ)
> blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..a6cbc6f 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -706,8 +706,8 @@ static void __write_dirty_buffer(struct dm_buffer *b,
>
> static void __flush_write_list(struct list_head *write_list)
> {
> - struct blk_plug plug;
> - blk_start_plug(&plug);
> + struct blk_plug *plug, onstack_plug;
> + plug = blk_start_plug(&onstack_plug);
> while (!list_empty(write_list)) {
> struct dm_buffer *b =
> list_entry(write_list->next, struct dm_buffer, write_list);
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
> submit_io(b, WRITE, b->block, write_endio);
> dm_bufio_cond_resched();
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> /*
> @@ -1110,13 +1110,13 @@ EXPORT_SYMBOL_GPL(dm_bufio_new);
> void dm_bufio_prefetch(struct dm_bufio_client *c,
> sector_t block, unsigned n_blocks)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> LIST_HEAD(write_list);
>
> BUG_ON(dm_bufio_in_request());
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> dm_bufio_lock(c);
>
> for (; n_blocks--; block++) {
> @@ -1126,9 +1126,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
> &write_list);
> if (unlikely(!list_empty(&write_list))) {
> dm_bufio_unlock(c);
> - blk_finish_plug(&plug);
> __flush_write_list(&write_list);
> - blk_start_plug(&plug);
> + blk_flush_plug_list(plug, false);
> dm_bufio_lock(c);
> }
> if (unlikely(b != NULL)) {
> @@ -1149,7 +1148,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
> dm_bufio_unlock(c);
>
> flush_plug:
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
> EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..fed371f 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -580,7 +580,7 @@ static void do_work(struct work_struct *work)
> {
> struct dm_kcopyd_client *kc = container_of(work,
> struct dm_kcopyd_client, kcopyd_work);
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> /*
> * The order that these are called is *very* important.
> @@ -589,11 +589,11 @@ static void do_work(struct work_struct *work)
> * list. io jobs call wake when they complete and it all
> * starts again.
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> process_jobs(&kc->complete_jobs, kc, run_complete_job);
> process_jobs(&kc->pages_jobs, kc, run_pages_job);
> process_jobs(&kc->io_jobs, kc, run_io_job);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..6a7459a 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1775,7 +1775,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> unsigned long flags;
> struct bio *bio;
> struct bio_list bios;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> unsigned count = 0;
>
> if (tc->requeue_mode) {
> @@ -1799,7 +1799,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>
> spin_unlock_irqrestore(&tc->lock, flags);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while ((bio = bio_list_pop(&bios))) {
> /*
> * If we've got no free new_mapping structs, and processing
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
> dm_pool_issue_prefetches(pool->pmd);
> }
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..9f24719 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7402,7 +7402,7 @@ void md_do_sync(struct md_thread *thread)
> int skipped = 0;
> struct md_rdev *rdev;
> char *desc, *action = NULL;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> /* just incase thread restarts... */
> if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
> @@ -7572,7 +7572,7 @@ void md_do_sync(struct md_thread *thread)
> md_new_event(mddev);
> update_time = jiffies;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while (j < max_sectors) {
> sector_t sectors;
>
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
> /*
> * this also signals 'finished resyncing' to md_stop
> */
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>
> /* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..2608bc3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2399,11 +2399,11 @@ static void raid1d(struct md_thread *thread)
> unsigned long flags;
> struct r1conf *conf = mddev->private;
> struct list_head *head = &conf->retry_list;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> md_check_recovery(mddev);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (;;) {
>
> flush_pending_writes(conf);
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
> if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
> md_check_recovery(mddev);
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..7bea5c7 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2791,11 +2791,11 @@ static void raid10d(struct md_thread *thread)
> unsigned long flags;
> struct r10conf *conf = mddev->private;
> struct list_head *head = &conf->retry_list;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> md_check_recovery(mddev);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (;;) {
>
> flush_pending_writes(conf);
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
> if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
> md_check_recovery(mddev);
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..59e7090 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5259,11 +5259,11 @@ static void raid5_do_work(struct work_struct *work)
> struct r5conf *conf = group->conf;
> int group_id = group - conf->worker_groups;
> int handled;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> pr_debug("+++ raid5worker active\n");
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> handled = 0;
> spin_lock_irq(&conf->device_lock);
> while (1) {
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
> pr_debug("%d stripes handled\n", handled);
>
> spin_unlock_irq(&conf->device_lock);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> pr_debug("--- raid5worker inactive\n");
> }
> @@ -5298,13 +5298,13 @@ static void raid5d(struct md_thread *thread)
> struct mddev *mddev = thread->mddev;
> struct r5conf *conf = mddev->private;
> int handled;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> pr_debug("+++ raid5d active\n");
>
> md_check_recovery(mddev);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> handled = 0;
> spin_lock_irq(&conf->device_lock);
> while (1) {
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
> spin_unlock_irq(&conf->device_lock);
>
> async_tx_issue_pending_all();
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> pr_debug("--- raid5d inactive\n");
> }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..248a6bb 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -361,13 +361,13 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
>
> static void iblock_submit_bios(struct bio_list *list, int rw)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> struct bio *bio;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while ((bio = bio_list_pop(list)))
> submit_bio(rw, bio);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b1c3583 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1575,7 +1575,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> struct kioctx *ctx;
> long ret = 0;
> int i = 0;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> if (unlikely(nr < 0))
> return -EINVAL;
> @@ -1592,7 +1592,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> return -EINVAL;
> }
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> /*
> * AKPM: should this return a partial result if some of the IOs were
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> if (ret)
> break;
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> percpu_ref_put(&ctx->users);
> return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..928d3e0 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1598,10 +1598,10 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct file *file = iocb->ki_filp;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> ssize_t ret;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> ret = __generic_file_write_iter(iocb, from);
> if (ret > 0) {
> ssize_t err;
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (err < 0)
> ret = err;
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> return ret;
> }
> EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..768bac3 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2967,7 +2967,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
> struct btrfs_root *root = fs_info->extent_root;
> struct btrfs_root *csum_root = fs_info->csum_root;
> struct btrfs_extent_item *extent;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> u64 flags;
> int ret;
> int slot;
> @@ -3088,7 +3088,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
> * collect all data csums for the stripe to avoid seeking during
> * the scrub. This might currently (crc32) end up to be about 1MB
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> /*
> * now find all extents for each stripe and scrub them
> @@ -3316,7 +3316,7 @@ out:
> scrub_wr_submit(sctx);
> mutex_unlock(&sctx->wr_ctx.wr_lock);
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> btrfs_free_path(path);
> btrfs_free_path(ppath);
> return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..b5a8078 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -979,11 +979,11 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
> {
> int ret;
> int ret2;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>
> if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..1623924 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2519,7 +2519,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
> int log_transid = 0;
> struct btrfs_log_ctx root_log_ctx;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> mutex_lock(&root->log_mutex);
> log_transid = ctx->log_transid;
> @@ -2571,10 +2571,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> /* we start IO on all the marked extents here, but we don't actually
> * wait for them until later.
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
> if (ret) {
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> btrfs_abort_transaction(trans, root, ret);
> btrfs_free_logged_extents(log, log_transid);
> btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> if (!list_empty(&root_log_ctx.list))
> list_del_init(&root_log_ctx.list);
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> btrfs_set_log_full_commit(root->fs_info, trans);
>
> if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> }
>
> if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> mutex_unlock(&log_root_tree->log_mutex);
> ret = root_log_ctx.log_ret;
> goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>
> index2 = root_log_ctx.log_transid % 2;
> if (atomic_read(&log_root_tree->log_commit[index2])) {
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
> mark);
> btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> * check the full commit flag again
> */
> if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> btrfs_free_logged_extents(log, log_transid);
> mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> ret = btrfs_write_marked_extents(log_root_tree,
> &log_root_tree->dirty_log_pages,
> EXTENT_DIRTY | EXTENT_NEW);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> if (ret) {
> btrfs_set_log_full_commit(root->fs_info, trans);
> btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..0e215ff 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -261,7 +261,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
> unsigned long last_waited = 0;
> int force_reg = 0;
> int sync_pending = 0;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> /*
> * this function runs all the bios we've collected for
> @@ -269,7 +269,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
> * another device without first sending all of these down.
> * So, setup a plug here and finish it off before we return
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> bdi = blk_get_backing_dev_info(device->bdev);
> fs_info = device->dev_root->fs_info;
> @@ -358,8 +358,7 @@ loop_lock:
> if (pending_bios == &device->pending_sync_bios) {
> sync_pending = 1;
> } else if (sync_pending) {
> - blk_finish_plug(&plug);
> - blk_start_plug(&plug);
> + blk_flush_plug_list(plug, false);
> sync_pending = 0;
> }
>
> @@ -415,8 +414,7 @@ loop_lock:
> }
> /* unplug every 64 requests just for good measure */
> if (batch_run % 64 == 0) {
> - blk_finish_plug(&plug);
> - blk_start_plug(&plug);
> + blk_flush_plug_list(plug, false);
> sync_pending = 0;
> }
> }
> @@ -431,7 +429,7 @@ loop_lock:
> spin_unlock(&device->io_lock);
>
> done:
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..727d642 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -717,10 +717,10 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
> struct list_head tmp;
> struct address_space *mapping;
> int err = 0, err2;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> INIT_LIST_HEAD(&tmp);
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> spin_lock(lock);
> while (!list_empty(list)) {
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
> }
>
> spin_unlock(lock);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> spin_lock(lock);
>
> while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..e79c0c6 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -488,6 +488,8 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
> static void dio_await_completion(struct dio *dio)
> {
> struct bio *bio;
> +
> + blk_flush_plug(current);
> do {
> bio = dio_await_one(dio);
> if (bio)
> @@ -1108,7 +1110,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> struct dio *dio;
> struct dio_submit sdio = { 0, };
> struct buffer_head map_bh = { 0, };
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> unsigned long align = offset | iov_iter_alignment(iter);
>
> if (rw & WRITE)
> @@ -1231,7 +1233,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>
> sdio.pages_in_io += iov_iter_npages(iter, INT_MAX);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> retval = do_direct_IO(dio, &sdio, &map_bh);
> if (retval)
> @@ -1262,7 +1264,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> if (sdio.bio)
> dio_bio_submit(dio, &sdio);
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> /*
> * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..fd0cf21 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> struct file *file = iocb->ki_filp;
> struct inode *inode = file_inode(iocb->ki_filp);
> struct mutex *aio_mutex = NULL;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int o_direct = io_is_direct(file);
> int overwrite = 0;
> size_t length = iov_iter_count(from);
> @@ -139,7 +139,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> iocb->private = &overwrite;
> if (o_direct) {
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
>
> /* check whether we do a DIO overwrite or not */
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ret = err;
> }
> if (o_direct)
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> errout:
> if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..d4b645b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2284,7 +2284,7 @@ static int ext4_writepages(struct address_space *mapping,
> int needed_blocks, rsv_blocks = 0, ret = 0;
> struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
> bool done;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> bool give_up_on_write = false;
>
> trace_ext4_writepages(inode, wbc);
> @@ -2298,11 +2298,9 @@ static int ext4_writepages(struct address_space *mapping,
> goto out_writepages;
>
> if (ext4_should_journal_data(inode)) {
> - struct blk_plug plug;
> -
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> goto out_writepages;
> }
>
> @@ -2368,7 +2366,7 @@ retry:
> if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
> done = false;
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while (!done && mpd.first_page <= mpd.last_page) {
> /* For each extent of pages we use new io_end */
> mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
> @@ -2438,7 +2436,7 @@ retry:
> if (ret)
> break;
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> if (!ret && !cycled && wbc->nr_to_write > 0) {
> cycled = 1;
> mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..de2b522 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -810,10 +810,10 @@ static int block_operations(struct f2fs_sb_info *sbi)
> .nr_to_write = LONG_MAX,
> .for_reclaim = 0,
> };
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int err = 0;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> retry_flush_dents:
> f2fs_lock_all(sbi);
> @@ -846,7 +846,7 @@ retry_flush_nodes:
> goto retry_flush_nodes;
> }
> out:
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> return err;
> }
>
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..082e961 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -661,12 +661,12 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
> {
> struct page *sum_page;
> struct f2fs_summary_block *sum;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> /* read segment summary of victim */
> sum_page = get_sum_page(sbi, segno);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> sum = page_address(sum_page);
>
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
> gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
> break;
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
> stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..e1fc81e 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1066,7 +1066,7 @@ repeat:
> struct page *get_node_page_ra(struct page *parent, int start)
> {
> struct f2fs_sb_info *sbi = F2FS_P_SB(parent);
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> struct page *page;
> int err, i, end;
> nid_t nid;
> @@ -1086,7 +1086,7 @@ repeat:
> else if (err == LOCKED_PAGE)
> goto page_hit;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> /* Then, try readahead for siblings of the desired node */
> end = start + MAX_RA_NODE;
> @@ -1098,7 +1098,7 @@ repeat:
> ra_node_page(sbi, nid);
> }
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> lock_page(page);
> if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..29b550d 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -258,12 +258,12 @@ static void
> __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
> {
> int i;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (i = 0; i < *batch_count; i++)
> write_dirty_buffer(bhs[i], WRITE_SYNC);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> for (i = 0; i < *batch_count; i++) {
> struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..ccb32f0 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -311,7 +311,7 @@ void journal_commit_transaction(journal_t *journal)
> int first_tag = 0;
> int tag_flag;
> int i;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int write_op = WRITE;
>
> /*
> @@ -444,10 +444,10 @@ void journal_commit_transaction(journal_t *journal)
> * Now start flushing things to disk, in the order they appear
> * on the transaction lists. Data blocks go first.
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> err = journal_submit_data_buffers(journal, commit_transaction,
> write_op);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> /*
> * Wait for all previously submitted IO to complete.
> @@ -503,7 +503,7 @@ void journal_commit_transaction(journal_t *journal)
> err = 0;
> }
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> journal_write_revoke_records(journal, commit_transaction, write_op);
>
> @@ -697,7 +697,7 @@ start_journal_io:
> }
> }
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> /* Lo and behold: we have just managed to send a transaction to
> the log. Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..31b5e73 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -182,12 +182,12 @@ static void
> __flush_batch(journal_t *journal, int *batch_count)
> {
> int i;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (i = 0; i < *batch_count; i++)
> write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> for (i = 0; i < *batch_count; i++) {
> struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..713d26a 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -390,7 +390,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> int tag_bytes = journal_tag_bytes(journal);
> struct buffer_head *cbh = NULL; /* For transactional checksums */
> __u32 crc32_sum = ~0;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> /* Tail of the journal */
> unsigned long first_block;
> tid_t first_tid;
> @@ -555,7 +555,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> if (err)
> jbd2_journal_abort(journal, err);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> jbd2_journal_write_revoke_records(journal, commit_transaction,
> &log_bufs, WRITE_SYNC);
>
> @@ -805,7 +805,7 @@ start_journal_io:
> __jbd2_journal_abort_hard(journal);
> }
>
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> /* Lo and behold: we have just managed to send a transaction to
> the log. Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..926dc42 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -676,10 +676,10 @@ int
> mpage_writepages(struct address_space *mapping,
> struct writeback_control *wbc, get_block_t get_block)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int ret;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> if (!get_block)
> ret = generic_writepages(mapping, wbc);
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
> if (mpd.bio)
> mpage_bio_submit(WRITE, mpd.bio);
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> return ret;
> }
> EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..776ac5a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1208,7 +1208,7 @@ STATIC void
> _xfs_buf_ioapply(
> struct xfs_buf *bp)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int rw;
> int offset;
> int size;
> @@ -1281,7 +1281,7 @@ _xfs_buf_ioapply(
> */
> offset = bp->b_offset;
> size = BBTOB(bp->b_io_length);
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (i = 0; i < bp->b_map_count; i++) {
> xfs_buf_ioapply_map(bp, i, &offset, &size, rw);
> if (bp->b_error)
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
> if (size <= 0)
> break; /* all done */
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> /*
> @@ -1772,7 +1772,7 @@ __xfs_buf_delwri_submit(
> struct list_head *io_list,
> bool wait)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> struct xfs_buf *bp, *n;
> int pinned = 0;
>
> @@ -1806,7 +1806,7 @@ __xfs_buf_delwri_submit(
>
> list_sort(NULL, io_list, xfs_buf_cmp);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> list_for_each_entry_safe(bp, n, io_list, b_list) {
> bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
> bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>
> xfs_buf_submit(bp);
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> return pinned;
> }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..6074671 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -275,7 +275,7 @@ xfs_dir2_leaf_readbuf(
> struct xfs_inode *dp = args->dp;
> struct xfs_buf *bp = *bpp;
> struct xfs_bmbt_irec *map = mip->map;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int error = 0;
> int length;
> int i;
> @@ -404,7 +404,7 @@ xfs_dir2_leaf_readbuf(
> /*
> * Do we need more readahead?
> */
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (mip->ra_index = mip->ra_offset = i = 0;
> mip->ra_want > mip->ra_current && i < mip->map_blocks;
> i += geo->fsbcount) {
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
> }
> }
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> out:
> *bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..b890aa4 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -179,7 +179,7 @@ xfs_bulkstat_ichunk_ra(
> struct xfs_inobt_rec_incore *irec)
> {
> xfs_agblock_t agbno;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int blks_per_cluster;
> int inodes_per_cluster;
> int i; /* inode chunk index */
> @@ -188,7 +188,7 @@ xfs_bulkstat_ichunk_ra(
> blks_per_cluster = xfs_icluster_size_fsb(mp);
> inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (i = 0; i < XFS_INODES_PER_CHUNK;
> i += inodes_per_cluster, agbno += blks_per_cluster) {
> if (xfs_inobt_maskn(i, inodes_per_cluster) & ~irec->ir_free) {
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
> &xfs_inode_buf_ops);
> }
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> }
>
> /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..4617fce 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
> * schedule() where blk_schedule_flush_plug() is called.
> */
> struct blk_plug {
> + int depth; /* number of nested plugs */
> struct list_head list; /* requests */
> struct list_head mq_list; /* blk-mq requests */
> struct list_head cb_list; /* md requires an unplug callback */
> @@ -1106,7 +1107,7 @@ struct blk_plug_cb {
> };
> extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
> void *data, int size);
> -extern void blk_start_plug(struct blk_plug *);
> +extern struct blk_plug *blk_start_plug(struct blk_plug *);
> extern void blk_finish_plug(struct blk_plug *);
> extern void blk_flush_plug_list(struct blk_plug *, bool);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..1f7d5ad 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -463,7 +463,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> int error = -EINVAL;
> int write;
> size_t len;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> @@ -503,7 +503,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> if (vma && start > vma->vm_start)
> prev = vma;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (;;) {
> /* Still start < end. */
> error = -ENOMEM;
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> vma = find_vma(current->mm, start);
> }
> out:
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> if (write)
> up_write(&current->mm->mmap_sem);
> else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..9369d5e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2011,16 +2011,16 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
> int generic_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> int ret;
>
> /* deal with chardevs and other special file */
> if (!mapping->a_ops->writepage)
> return 0;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> return ret;
> }
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..c3350b5 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -111,11 +111,11 @@ EXPORT_SYMBOL(read_cache_pages);
> static int read_pages(struct address_space *mapping, struct file *filp,
> struct list_head *pages, unsigned nr_pages)
> {
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> unsigned page_idx;
> int ret;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
>
> if (mapping->a_ops->readpages) {
> ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
> ret = 0;
>
> out:
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> return ret;
> }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..3f6b8ec 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -455,7 +455,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = entry_offset;
> unsigned long start_offset, end_offset;
> unsigned long mask;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
>
> mask = swapin_nr_pages(offset) - 1;
> if (!mask)
> @@ -467,7 +467,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> SetPageReadahead(page);
> page_cache_release(page);
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
>
> lru_add_drain(); /* Push any new pages onto the LRU now */
> skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..fd29974 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2130,7 +2130,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
> enum lru_list lru;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> - struct blk_plug plug;
> + struct blk_plug *plug, onstack_plug;
> bool scan_adjusted;
>
> get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
> @@ -2152,7 +2152,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
> scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
> sc->priority == DEF_PRIORITY);
>
> - blk_start_plug(&plug);
> + plug = blk_start_plug(&onstack_plug);
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> unsigned long nr_anon, nr_file, percentage;
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>
> scan_adjusted = true;
> }
> - blk_finish_plug(&plug);
> + blk_finish_plug(plug);
> sc->nr_reclaimed += nr_reclaimed;
>
> /*
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Ming Lei

2015-04-07 14:47:45

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

Ming Lei <[email protected]> writes:

> Hi Jeff,

Hi, Ming. Thanks for reviewing!

>> Unpatched kernel:
>> Read B/W: 283,638 KB/s
>> Read Merges: 0
>>
>> Patched kernel:
>> Read B/W: 873,224 KB/s
>> Read Merges: 2,046K
>
> The data is amazing, but maybe better to provide some latency
> data.

Yes, good point. I'll include the full fio output in the next posting.

>> Much of the patch involves modifying call sites to blk_start_plug,
>> since its signature is changed. The meat of the patch is actually
>
> I am wondering if the type of blk_start_plug has to be changed
> since the active plug is always the top plug with your patch, and
> blk_finish_plug() can find the active plug from current->plug.

That is a much simpler idea, thank you. I think blk_finish_plug
shouldn't take the plug argument, though, since it won't actually use
it.

Cheers,
Jeff

2015-04-08 18:50:17

by Shaohua Li

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

On Wed, Apr 08, 2015 at 11:38:14AM -0600, Jens Axboe wrote:
> On 04/06/2015 01:14 PM, Jeff Moyer wrote:
> >The way the on-stack plugging currently works, each nesting level
> >flushes its own list of I/Os. This can be less than optimal (read
> >awful) for certain workloads. For example, consider an application
> >that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
> >I/Os together in a single io_submit call, only to have each of them
> >dispatched individually down in the bowells of the dirct I/O code.
> >The reason is that there are blk_plug's instantiated both at the upper
> >call site in do_io_submit and down in do_direct_IO. The latter will
> >submit as little as 1 I/O at a time (if you have a small enough I/O
> >size) instead of performing the batching that the plugging
> >infrastructure is supposed to provide.
> >
> >Now, for the case where there is an elevator involved, this doesn't
> >really matter too much. The elevator will keep the I/O around long
> >enough for it to be merged. However, in cases where there is no
> >elevator (like blk-mq), I/Os are simply dispatched immediately.
> >
> >Try this, for example (note I'm using a virtio-blk device, so it's
> >using the blk-mq single queue path, though I've also reproduced this
> >with the micron p320h):
> >
> >fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
> >
> >If you run that on a current kernel, you will get zero merges. Zero!
> >After this patch, you will get many merges (the actual number depends
> >on how fast your storage is, obviously), and much better throughput.
> >Here are results from my test rig:
> >
> >Unpatched kernel:
> >Read B/W: 283,638 KB/s
> >Read Merges: 0
> >
> >Patched kernel:
> >Read B/W: 873,224 KB/s
> >Read Merges: 2,046K
> >
> >I considered several approaches to solving the problem:
> >1) get rid of the inner-most plugs
> >2) handle nesting by using only one on-stack plug
> >2a) #2, except use a per-cpu blk_plug struct, which may clean up the
> > code a bit at the expense of memory footprint
> >
> >Option 1 will be tricky or impossible to do, since inner most plug
> >lists are sometimes the only plug lists, depending on the call path.
> >Option 2 is what this patch implements. Option 2a is perhaps a better
> >idea, but since I already implemented option 2, I figured I'd post it
> >for comments and opinions before rewriting it.
> >
> >Much of the patch involves modifying call sites to blk_start_plug,
> >since its signature is changed. The meat of the patch is actually
> >pretty simple and constrained to block/blk-core.c and
> >include/linux/blkdev.h. The only tricky bits were places where plugs
> >were finished and then restarted to flush out I/O. There, I went
> >ahead and exported blk_flush_plug_list and called that directly.
> >
> >Comments would be greatly appreciated.
>
> It's hard to argue with the increased merging for your case. The task plugs
> did originally work like you changed them to, not flushing until the
> outermost plug was flushed. Unfortunately I don't quite remember why I
> changed them, will have to do a bit of digging to refresh my memory.

The behavior never changed. Current code only flush outermost plug.
blk_start_plug() doesn't assign plug to current task for inner plug, so
requests are all added to outmost plug.

maybe the code can be cleaned up as:
start_plug
{
if (current->plug)
returnl
current->plug = plug
}

end_plug()
{
if (plug != current->plug)
return;
flush_plug
current->plug = NULL;
}

2015-04-08 17:38:24

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

On 04/06/2015 01:14 PM, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os. This can be less than optimal (read
> awful) for certain workloads. For example, consider an application
> that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowells of the dirct I/O code.
> The reason is that there are blk_plug's instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO. The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.
>
> Now, for the case where there is an elevator involved, this doesn't
> really matter too much. The elevator will keep the I/O around long
> enough for it to be merged. However, in cases where there is no
> elevator (like blk-mq), I/Os are simply dispatched immediately.
>
> Try this, for example (note I'm using a virtio-blk device, so it's
> using the blk-mq single queue path, though I've also reproduced this
> with the micron p320h):
>
> fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
>
> If you run that on a current kernel, you will get zero merges. Zero!
> After this patch, you will get many merges (the actual number depends
> on how fast your storage is, obviously), and much better throughput.
> Here are results from my test rig:
>
> Unpatched kernel:
> Read B/W: 283,638 KB/s
> Read Merges: 0
>
> Patched kernel:
> Read B/W: 873,224 KB/s
> Read Merges: 2,046K
>
> I considered several approaches to solving the problem:
> 1) get rid of the inner-most plugs
> 2) handle nesting by using only one on-stack plug
> 2a) #2, except use a per-cpu blk_plug struct, which may clean up the
> code a bit at the expense of memory footprint
>
> Option 1 will be tricky or impossible to do, since inner most plug
> lists are sometimes the only plug lists, depending on the call path.
> Option 2 is what this patch implements. Option 2a is perhaps a better
> idea, but since I already implemented option 2, I figured I'd post it
> for comments and opinions before rewriting it.
>
> Much of the patch involves modifying call sites to blk_start_plug,
> since its signature is changed. The meat of the patch is actually
> pretty simple and constrained to block/blk-core.c and
> include/linux/blkdev.h. The only tricky bits were places where plugs
> were finished and then restarted to flush out I/O. There, I went
> ahead and exported blk_flush_plug_list and called that directly.
>
> Comments would be greatly appreciated.

It's hard to argue with the increased merging for your case. The task
plugs did originally work like you changed them to, not flushing until
the outermost plug was flushed. Unfortunately I don't quite remember why
I changed them, will have to do a bit of digging to refresh my memory.

For cases where we don't do any merging (like nvme), we always want to
flush. Well almost, if we start do utilize the batched submission, then
the plug would still potentially help (just for other reasons than merging).

And agree with Ming, this can be cleaned up substantially. I'd also like
to see some test results from the other end of the spectrum. Your posted
cased is clearly based case (we missed tons of merging, now we don't),
I'd like to see a normal case and a worst case result as well so we have
an idea of what this would do to latencies.

--
Jens Axboe

2015-04-08 17:56:55

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

Jens Axboe <[email protected]> writes:

>> Comments would be greatly appreciated.
>
> It's hard to argue with the increased merging for your case. The task
> plugs did originally work like you changed them to, not flushing until
> the outermost plug was flushed. Unfortunately I don't quite remember
> why I changed them, will have to do a bit of digging to refresh my
> memory.

Let me know what you dig up.

> For cases where we don't do any merging (like nvme), we always want to
> flush. Well almost, if we start do utilize the batched submission,
> then the plug would still potentially help (just for other reasons
> than merging).

It's never straight-forward. :)

> And agree with Ming, this can be cleaned up substantially. I'd also

Let me know if you have any issues with the v2 posting.

> like to see some test results from the other end of the spectrum. Your
> posted cased is clearly based case (we missed tons of merging, now we
> don't), I'd like to see a normal case and a worst case result as well
> so we have an idea of what this would do to latencies.

As for other benchmark numbers, this work was inspired by two reports of
lower performance after converting virtio_blk to use blk-mq in RHEL.
The workloads were both synthetic, though. I'd be happy to run a
battery of tests on the patch set, but if there is anything specific you
want to see (besides a workload that will have no merges), let me know.

I guess I should also mention that another solution to the problem might
be the mythical mq I/O scheduler. :)

Cheers,
Jeff

2015-04-16 15:47:17

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

Jens Axboe <[email protected]> writes:

> And agree with Ming, this can be cleaned up substantially. I'd also
> like to see some test results from the other end of the spectrum. Your
> posted cased is clearly based case (we missed tons of merging, now we
> don't), I'd like to see a normal case and a worst case result as well
> so we have an idea of what this would do to latencies.

I re-ran my tests (just sequential reads) to verify where the
performance benefit comes from. Here are the results:

device| vanilla | patch1 | dio-noplug | noflush-nested
------+------------+----------------+---------------+-----------------
rssda | 701,684 | 1,168,527 | 1,342,177 | 1,297,612
| 100% | +66% | +91% | +85%
vdb0 | 358,264 | 902,913 | 906,850 | 922,327
| 100% | +152% | +153% | +157%

Patch1 refers to the first patch in this series, which fixes the merge
logic for single-queue blk-mq devices. Each column after that includes
that first patch. In dio-noplug, I removed the blk_plug from the
direct-io code path (so there is no nesting at all). This is a control,
since it is what I expect the outcome of the noflush-nested column to
actually be. Then, the noflush-nested column leaves the blk_plug in
place in the dio code, but includes the patch that prevents nested
blk_plug's from being flushed. All numbers are the average of 5 runs.
With the exception of the vanilla run on rssda (the first run was
faster, causing the average to go up), the standard deviation is very
small.

As you can see, for the micron device (rssda) we get quite a bit of an
improvement from just fixing the plugging, but we could do quite a bit
better. For the virtio-blk device, the follow-on patches don't
contribute very much to the overall performance.

So, I'm happy to push in just patch 1 at this time. It is a bugfix,
after all, and could even be marked for stable. Does that sound
reasonable?

Later, if time permits, I can look into refining patch 2, though I don't
have any firm plans to do that right now.

Thanks,
Jeff