2011-05-05 02:04:22

by Shaohua Li

[permalink] [raw]
Subject: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

In some drives, flush requests are non-queueable. When flush request is running,
normal read/write requests can't run. If block layer dispatches such request,
driver can't handle it and requeue it.
Tejun suggested we can hold the queue when flush is running. This can avoid
unnecessary requeue.
Also this can improve performance. For example, we have request flush1, write1,
flush 2. flush1 is dispatched, then queue is hold, write1 isn't inserted to
queue. After flush1 is finished, flush2 will be dispatched. Since disk cache
is already clean, flush2 will be finished very soon, so looks like flush2 is
folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
block: make the flush insertion use the tail of the dispatch list

It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-flush.c | 16 +++++++++++-----
block/blk.h | 21 ++++++++++++++++++++-
include/linux/blkdev.h | 1 +
3 files changed, 32 insertions(+), 6 deletions(-)

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c 2011-05-05 08:36:51.000000000 +0800
+++ linux/block/blk-flush.c 2011-05-05 09:28:37.000000000 +0800
@@ -212,13 +212,19 @@ static void flush_end_io(struct request
}

/*
- * Moving a request silently to empty queue_head may stall the
- * queue. Kick the queue in those cases. This function is called
- * from request completion path and calling directly into
- * request_fn may confuse the driver. Always use kblockd.
+ * Kick the queue to avoid stall for two cases:
+ * 1. Moving a request silently to empty queue_head may stall the
+ * queue.
+ * 2. When flush request is running in non-queueable queue, the
+ * queue is hold. Restart the queue after flush request is finished
+ * to avoid stall.
+ * This function is called from request completion path and calling
+ * directly into request_fn may confuse the driver. Always use
+ * kblockd.
*/
- if (queued)
+ if (queued || q->flush_queue_delayed)
blk_run_queue_async(q);
+ q->flush_queue_delayed = 0;
}

/**
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h 2011-05-05 09:05:18.000000000 +0800
+++ linux/include/linux/blkdev.h 2011-05-05 09:06:08.000000000 +0800
@@ -365,6 +365,7 @@ struct request_queue
*/
unsigned int flush_flags;
unsigned int flush_not_queueable:1;
+ unsigned int flush_queue_delayed:1;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
unsigned long flush_pending_since;
Index: linux/block/blk.h
===================================================================
--- linux.orig/block/blk.h 2011-05-05 08:36:51.000000000 +0800
+++ linux/block/blk.h 2011-05-05 09:49:34.000000000 +0800
@@ -61,7 +61,26 @@ static inline struct request *__elv_next
rq = list_entry_rq(q->queue_head.next);
return rq;
}
-
+ /*
+ * Flush request is running and flush request isn't queueable
+ * in the drive, we can hold the queue till flush request is
+ * finished. Even we don't do this, driver can't dispatch next
+ * requests and will requeue them. And this can improve
+ * throughput too. For example, we have request flush1, write1,
+ * flush 2. flush1 is dispatched, then queue is hold, write1
+ * isn't inserted to queue. After flush1 is finished, flush2
+ * will be dispatched. Since disk cache is already clean,
+ * flush2 will be finished very soon, so looks like flush2 is
+ * folded to flush1.
+ * Since the queue is hold, a flag is set to indicate the queue
+ * should be restarted later. Please see flush_end_io() for
+ * details.
+ */
+ if (q->flush_pending_idx != q->flush_running_idx &&
+ !queue_flush_queueable(q)) {
+ q->flush_queue_delayed = 1;
+ return NULL;
+ }
if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}


2011-05-05 08:39:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

(cc'ing James, Ric, Christoph and lscsi. Hi! Please jump to the
bottom of the message.)

Hello,

On Thu, May 05, 2011 at 09:59:34AM +0800, [email protected] wrote:
> In some drives, flush requests are non-queueable. When flush request is running,
> normal read/write requests can't run. If block layer dispatches such request,
> driver can't handle it and requeue it.
> Tejun suggested we can hold the queue when flush is running. This can avoid
> unnecessary requeue.
> Also this can improve performance. For example, we have request flush1, write1,
> flush 2. flush1 is dispatched, then queue is hold, write1 isn't inserted to
> queue. After flush1 is finished, flush2 will be dispatched. Since disk cache
> is already clean, flush2 will be finished very soon, so looks like flush2 is
> folded to flush1.
> In my test, the queue holding completely solves a regression introduced by
> commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
> block: make the flush insertion use the tail of the dispatch list
>
> It's not a preempt type request, in fact we have to insert it
> behind requests that do specify INSERT_FRONT.
> which causes about 20% regression running a sysbench fileio
> workload.
>
> Signed-off-by: Shaohua Li <[email protected]>

Acked-by: Tejun Heo <[email protected]>

But, I think the description and comments can use some refinements.
First of all, new lines won't steal your first born. Don't be too
afraid of them both in the patch description and comments.

For the patch description, I want to recomment explaining the
regression case first - explain why the regression happened and then
show how this patch solves the issue. Also, more conventional way to
refer to a commit is 53d63e6b0d (block: make the flush insertion use
the tail of the dispatch list).

> /*
> - * Moving a request silently to empty queue_head may stall the
> - * queue. Kick the queue in those cases. This function is called
> - * from request completion path and calling directly into
> - * request_fn may confuse the driver. Always use kblockd.
> + * Kick the queue to avoid stall for two cases:
> + * 1. Moving a request silently to empty queue_head may stall the
> + * queue.
> + * 2. When flush request is running in non-queueable queue, the
> + * queue is hold. Restart the queue after flush request is finished
> + * to avoid stall.
> + * This function is called from request completion path and calling
> + * directly into request_fn may confuse the driver. Always use
> + * kblockd.
> */

Yeap, pretty good but let's add a bit more whitespaces and apply
slight adjustments.

/*
* After flush sequencing, the following two cases may lead to
* queue stall.
*
* 1. Moving a request silently to empty queue_head.
*
* 2. If flush request was non-queueable, request dispatching may
* have been blocked while flush was in progress.
*
* Make sure queue processing is restarted by kicking the queue.
* As this function is called from request completion path,
* calling directly into request_fn may confuse the driver. Always
* use kblockd.
*/

> + /*
> + * Flush request is running and flush request isn't queueable
> + * in the drive, we can hold the queue till flush request is
> + * finished. Even we don't do this, driver can't dispatch next
> + * requests and will requeue them. And this can improve
> + * throughput too. For example, we have request flush1, write1,
> + * flush 2. flush1 is dispatched, then queue is hold, write1
> + * isn't inserted to queue. After flush1 is finished, flush2
> + * will be dispatched. Since disk cache is already clean,
> + * flush2 will be finished very soon, so looks like flush2 is
> + * folded to flush1.
> + * Since the queue is hold, a flag is set to indicate the queue
> + * should be restarted later. Please see flush_end_io() for
> + * details.
> + */

Similarly, I'd like to suggest something like the following.

/*
* Hold dispatching of regular requests if non-queueable
* flush is in progress; otherwise, the low level driver
* would keep dispatching IO requests just to requeue them
* until the flush finishes, which not only adds
* dispatching / requeueing overhead but may also
* significantly affect throughput when multiple flushes
* are issued back-to-back. Please consider the following
* scenario.
*
* - flush1 is dispatched with write1 in the elevator.
*
* - Driver dispatches write1 and requeues it.
*
* - flush2 is issued and appended to dispatch queue after
* the requeued write1. As write1 has been requeued
* flush2 can't be put in front of it.
*
* - When flush1 finishes, the driver has to process write1
* before flush2 even though there's no fundamental
* reason flush2 can't be processed first and, when two
* flushes are issued back-to-back without intervening
* writes, the second one essentially becomes noop.
*
* This phenomena becomes quite visible under heavy
* concurrent fsync workload and holding the queue while
* flush is in progress leads to significant throughput
* gain.
*/

And two more things that I think are worth investigating.

- I wonder whether this would be useful for even devices which can
queue flushes (ie. native SCSI ones). There definitely are some
benefits to queueing flushes in terms of hiding command dispatching
overhead and if the device is smart/deep enough parallelly
processing non-conflicting operations (ie. reads and flushing later
writes together if the head sweeps that way).

That said, flushes are mostly mutually exclusive w.r.t. writes and
even with queueable flushes, we might benefit more by holding
further writes until flush finishes. Under light sync workload,
this doesn't matter anyway. Under heavy, the benefit of queueing
the later writes together can be easily outweighted by some of
flushes becoming noops.

Unfortunately (or rather, fortunately), I don't have any access to
such fancy devices so it would be great if the upscale storage guys
can tinker with it a bit and see how it fares. If it goes well, we
can also make things more advanced by implementing back-to-back
noop'ing in block layer and allowing issue of reads parallelly with
flushes, if the benefits they bring justify the added complexity.

- This is much more minor but if block layer already knows flushes are
non-queueable, it might be a good idea to hold dispatching of
flushes if other requests are already in progress. It will only
save dispatch/requeue overhead which might not matter at all, so
this has pretty good chance of not being worth of the added
complexity tho.

So, is anyone from the upscale storage world interested?

Thanks.

--
tejun

2011-05-06 04:32:10

by Shaohua Li

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

On Thu, 2011-05-05 at 16:38 +0800, Tejun Heo wrote:
> (cc'ing James, Ric, Christoph and lscsi. Hi! Please jump to the
> bottom of the message.)
>
> Hello,
>
> On Thu, May 05, 2011 at 09:59:34AM +0800, [email protected] wrote:
> > In some drives, flush requests are non-queueable. When flush request is running,
> > normal read/write requests can't run. If block layer dispatches such request,
> > driver can't handle it and requeue it.
> > Tejun suggested we can hold the queue when flush is running. This can avoid
> > unnecessary requeue.
> > Also this can improve performance. For example, we have request flush1, write1,
> > flush 2. flush1 is dispatched, then queue is hold, write1 isn't inserted to
> > queue. After flush1 is finished, flush2 will be dispatched. Since disk cache
> > is already clean, flush2 will be finished very soon, so looks like flush2 is
> > folded to flush1.
> > In my test, the queue holding completely solves a regression introduced by
> > commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
> > block: make the flush insertion use the tail of the dispatch list
> >
> > It's not a preempt type request, in fact we have to insert it
> > behind requests that do specify INSERT_FRONT.
> > which causes about 20% regression running a sysbench fileio
> > workload.
> >
> > Signed-off-by: Shaohua Li <[email protected]>
>
> Acked-by: Tejun Heo <[email protected]>
Thanks. I updated changelogs.

> And two more things that I think are worth investigating.
>
> - I wonder whether this would be useful for even devices which can
> queue flushes (ie. native SCSI ones). There definitely are some
> benefits to queueing flushes in terms of hiding command dispatching
> overhead and if the device is smart/deep enough parallelly
> processing non-conflicting operations (ie. reads and flushing later
> writes together if the head sweeps that way).
>
> That said, flushes are mostly mutually exclusive w.r.t. writes and
> even with queueable flushes, we might benefit more by holding
> further writes until flush finishes. Under light sync workload,
> this doesn't matter anyway. Under heavy, the benefit of queueing
> the later writes together can be easily outweighted by some of
> flushes becoming noops.
>
> Unfortunately (or rather, fortunately), I don't have any access to
> such fancy devices so it would be great if the upscale storage guys
> can tinker with it a bit and see how it fares. If it goes well, we
> can also make things more advanced by implementing back-to-back
> noop'ing in block layer and allowing issue of reads parallelly with
> flushes, if the benefits they bring justify the added complexity.
>
> - This is much more minor but if block layer already knows flushes are
> non-queueable, it might be a good idea to hold dispatching of
> flushes if other requests are already in progress. It will only
> save dispatch/requeue overhead which might not matter at all, so
> this has pretty good chance of not being worth of the added
> complexity tho.
I did some experiment to hold flush too, but no obvious performance
difference. It doesn't make more flush requests merge. Avoiding
unnecessary requeue is a gain for fast devices, but my test doesn't
show.


Subject: block: hold queue if flush is running for non-queueable flush drive

Commit 53d63e6b0dfb9(block: make the flush insertion use the tail of
the dispatch list) causes about 20% regression running a sysbench fileio
workload. Let's consider the following scenario:
- flush1 is dispatched with write1 in the elevator.
- Driver dispatches write1 and requeues it.
- flush2 is issued and appended to dispatch queue after the requeued write1.
As write1 has been requeued flush2 can't be put in front of it.
- When flush1 finishes, the driver has to process write1 before flush2 even
though there's no fundamental reason flush2 can't be processed first and,
when two flushes are issued back-to-back without intervening writes, the
second one essentially becomes noop.
Without the commit, flush2 is inserted before write1, so the issue is hiden.
But the commit itself makes sense, because flush request isn't a preempt
request, there is no reason to add it to queue head.

The regression is exposed in a SATA device. In SATA, flush requests are
non-queueable. When flush request is running, normal read/write requests
can't run. If block layer dispatches such request, driver can't handle it
and requeue it. Tejun suggested we can hold the queue when flush is running.
This can avoid unnecessary requeue.

And also this can improve performance and solve the regression. In above
scenario, when flush1 is running, queue is hold, so write1 isn't dispatched.
flush2 will be the only request in the queue. After flush1 is finished, flush2
will be dispatched soon. Since there is no write between flush1 and flush2,
flush2 essentially becomes noop.

Signed-off-by: Shaohua Li <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/blk-flush.c | 19 ++++++++++++++-----
block/blk.h | 35 ++++++++++++++++++++++++++++++++++-
include/linux/blkdev.h | 1 +
3 files changed, 49 insertions(+), 6 deletions(-)

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c 2011-05-05 10:33:03.000000000 +0800
+++ linux/block/blk-flush.c 2011-05-06 11:21:20.000000000 +0800
@@ -212,13 +212,22 @@ static void flush_end_io(struct request
}

/*
- * Moving a request silently to empty queue_head may stall the
- * queue. Kick the queue in those cases. This function is called
- * from request completion path and calling directly into
- * request_fn may confuse the driver. Always use kblockd.
+ * After flush sequencing, the following two cases may lead to
+ * queue stall.
+ *
+ * 1. Moving a request silently to empty queue_head.
+ *
+ * 2. If flush request was non-queueable, request dispatching may
+ * have been blocked while flush was in progress.
+ *
+ * Make sure queue processing is restarted by kicking the queue.
+ * As this function is called from request completion path,
+ * calling directly into request_fn may confuse the driver. Always
+ * use kblockd.
*/
- if (queued)
+ if (queued || q->flush_queue_delayed)
blk_run_queue_async(q);
+ q->flush_queue_delayed = 0;
}

/**
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h 2011-05-06 11:20:08.000000000 +0800
+++ linux/include/linux/blkdev.h 2011-05-06 11:20:14.000000000 +0800
@@ -365,6 +365,7 @@ struct request_queue
*/
unsigned int flush_flags;
unsigned int flush_not_queueable:1;
+ unsigned int flush_queue_delayed:1;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
unsigned long flush_pending_since;
Index: linux/block/blk.h
===================================================================
--- linux.orig/block/blk.h 2011-05-05 10:33:03.000000000 +0800
+++ linux/block/blk.h 2011-05-06 11:22:42.000000000 +0800
@@ -61,7 +61,40 @@ static inline struct request *__elv_next
rq = list_entry_rq(q->queue_head.next);
return rq;
}
-
+ /*
+ * Hold dispatching of regular requests if non-queueable
+ * flush is in progress; otherwise, the low level driver
+ * would keep dispatching IO requests just to requeue them
+ * until the flush finishes, which not only adds
+ * dispatching / requeueing overhead but may also
+ * significantly affect throughput when multiple flushes
+ * are issued back-to-back. Please consider the following
+ * scenario.
+ *
+ * - flush1 is dispatched with write1 in the elevator.
+ *
+ * - Driver dispatches write1 and requeues it.
+ *
+ * - flush2 is issued and appended to dispatch queue after
+ * the requeued write1. As write1 has been requeued
+ * flush2 can't be put in front of it.
+ *
+ * - When flush1 finishes, the driver has to process write1
+ * before flush2 even though there's no fundamental
+ * reason flush2 can't be processed first and, when two
+ * flushes are issued back-to-back without intervening
+ * writes, the second one essentially becomes noop.
+ *
+ * This phenomena becomes quite visible under heavy
+ * concurrent fsync workload and holding the queue while
+ * flush is in progress leads to significant throughput
+ * gain.
+ */
+ if (q->flush_pending_idx != q->flush_running_idx &&
+ !queue_flush_queueable(q)) {
+ q->flush_queue_delayed = 1;
+ return NULL;
+ }
if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}

2011-05-06 06:53:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

Hello,

On Fri, May 06, 2011 at 12:32:05PM +0800, Shaohua Li wrote:
> > - This is much more minor but if block layer already knows flushes are
> > non-queueable, it might be a good idea to hold dispatching of
> > flushes if other requests are already in progress. It will only
> > save dispatch/requeue overhead which might not matter at all, so
> > this has pretty good chance of not being worth of the added
> > complexity tho.
>
> I did some experiment to hold flush too, but no obvious performance
> difference. It doesn't make more flush requests merge. Avoiding
> unnecessary requeue is a gain for fast devices, but my test doesn't
> show.

I see. Thanks for testing.

> Subject: block: hold queue if flush is running for non-queueable flush drive
>
> Commit 53d63e6b0dfb9(block: make the flush insertion use the tail of
> the dispatch list) causes about 20% regression running a sysbench fileio
> workload. Let's consider the following scenario:
> - flush1 is dispatched with write1 in the elevator.
> - Driver dispatches write1 and requeues it.
> - flush2 is issued and appended to dispatch queue after the requeued write1.
> As write1 has been requeued flush2 can't be put in front of it.
> - When flush1 finishes, the driver has to process write1 before flush2 even
> though there's no fundamental reason flush2 can't be processed first and,
> when two flushes are issued back-to-back without intervening writes, the
> second one essentially becomes noop.
> Without the commit, flush2 is inserted before write1, so the issue is hiden.
> But the commit itself makes sense, because flush request isn't a preempt
> request, there is no reason to add it to queue head.
>
> The regression is exposed in a SATA device. In SATA, flush requests are
> non-queueable. When flush request is running, normal read/write requests
> can't run. If block layer dispatches such request, driver can't handle it
> and requeue it. Tejun suggested we can hold the queue when flush is running.
> This can avoid unnecessary requeue.
>
> And also this can improve performance and solve the regression. In above
> scenario, when flush1 is running, queue is hold, so write1 isn't dispatched.
> flush2 will be the only request in the queue. After flush1 is finished, flush2
> will be dispatched soon. Since there is no write between flush1 and flush2,
> flush2 essentially becomes noop.
>
> Signed-off-by: Shaohua Li <[email protected]>
> Acked-by: Tejun Heo <[email protected]>

Jens, can you please queue this for the next merge window?

Thanks.

--
tejun

2011-05-06 17:28:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

On 2011-05-06 00:53, Tejun Heo wrote:
> Jens, can you please queue this for the next merge window?

Yep I will, I'll mark it as a backport as well.

--
Jens Axboe

2011-05-09 13:03:47

by Vivek Goyal

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

On Thu, May 05, 2011 at 10:38:53AM +0200, Tejun Heo wrote:

[..]
> Similarly, I'd like to suggest something like the following.
>
> /*
> * Hold dispatching of regular requests if non-queueable
> * flush is in progress; otherwise, the low level driver
> * would keep dispatching IO requests just to requeue them
> * until the flush finishes, which not only adds
> * dispatching / requeueing overhead but may also
> * significantly affect throughput when multiple flushes
> * are issued back-to-back. Please consider the following
> * scenario.
> *
> * - flush1 is dispatched with write1 in the elevator.
> *
> * - Driver dispatches write1 and requeues it.
> *
> * - flush2 is issued and appended to dispatch queue after
> * the requeued write1. As write1 has been requeued
> * flush2 can't be put in front of it.
> *
> * - When flush1 finishes, the driver has to process write1
> * before flush2 even though there's no fundamental
> * reason flush2 can't be processed first and, when two
> * flushes are issued back-to-back without intervening
> * writes, the second one essentially becomes noop.
> *
> * This phenomena becomes quite visible under heavy
> * concurrent fsync workload and holding the queue while
> * flush is in progress leads to significant throughput
> * gain.
> */

Tejun,

I am assuming that these back-to-back flushes are independent of each
other otherwise write request will anyway get between two flushes.

If that's the case, then should we solve the problem by improving flush
merge logic a bit better. (Say idle a bit before issuing a flush only
if request queue is not empty).

That way multiple back to back flushes can be merged without taking a hit on
throughput and we can avoid this special casing whether driver can queue the
flush or not.

Thanks
Vivek

2011-05-09 13:50:52

by Shaohua Li

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

On Mon, May 09, 2011 at 09:03:16PM +0800, Vivek Goyal wrote:
> On Thu, May 05, 2011 at 10:38:53AM +0200, Tejun Heo wrote:
>
> [..]
> > Similarly, I'd like to suggest something like the following.
> >
> > /*
> > * Hold dispatching of regular requests if non-queueable
> > * flush is in progress; otherwise, the low level driver
> > * would keep dispatching IO requests just to requeue them
> > * until the flush finishes, which not only adds
> > * dispatching / requeueing overhead but may also
> > * significantly affect throughput when multiple flushes
> > * are issued back-to-back. Please consider the following
> > * scenario.
> > *
> > * - flush1 is dispatched with write1 in the elevator.
> > *
> > * - Driver dispatches write1 and requeues it.
> > *
> > * - flush2 is issued and appended to dispatch queue after
> > * the requeued write1. As write1 has been requeued
> > * flush2 can't be put in front of it.
> > *
> > * - When flush1 finishes, the driver has to process write1
> > * before flush2 even though there's no fundamental
> > * reason flush2 can't be processed first and, when two
> > * flushes are issued back-to-back without intervening
> > * writes, the second one essentially becomes noop.
> > *
> > * This phenomena becomes quite visible under heavy
> > * concurrent fsync workload and holding the queue while
> > * flush is in progress leads to significant throughput
> > * gain.
> > */
>
> Tejun,
>
> I am assuming that these back-to-back flushes are independent of each
> other otherwise write request will anyway get between two flushes.
Hi,
yes, the flushes are independent.

> If that's the case, then should we solve the problem by improving flush
> merge logic a bit better. (Say idle a bit before issuing a flush only
> if request queue is not empty).
I tried some ways to improve flush merge logic. The problem I observed is something like:
say we have 10 flushes, originally we dispatch 4 flush, write, 6 flush. doing more merge
we have 6 flush, write, 4 flush. the flush request number sent to drive isn't reduced.
Another reason I didn't see improvement with better back-to-back merge might be drive
already optimizes two adjacent flushes case well.

Thanks,
Shaohua

2011-05-09 13:59:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

On Mon, May 09, 2011 at 09:50:46PM +0800, Shaohua Li wrote:
> On Mon, May 09, 2011 at 09:03:16PM +0800, Vivek Goyal wrote:
> > On Thu, May 05, 2011 at 10:38:53AM +0200, Tejun Heo wrote:
> >
> > [..]
> > > Similarly, I'd like to suggest something like the following.
> > >
> > > /*
> > > * Hold dispatching of regular requests if non-queueable
> > > * flush is in progress; otherwise, the low level driver
> > > * would keep dispatching IO requests just to requeue them
> > > * until the flush finishes, which not only adds
> > > * dispatching / requeueing overhead but may also
> > > * significantly affect throughput when multiple flushes
> > > * are issued back-to-back. Please consider the following
> > > * scenario.
> > > *
> > > * - flush1 is dispatched with write1 in the elevator.
> > > *
> > > * - Driver dispatches write1 and requeues it.
> > > *
> > > * - flush2 is issued and appended to dispatch queue after
> > > * the requeued write1. As write1 has been requeued
> > > * flush2 can't be put in front of it.
> > > *
> > > * - When flush1 finishes, the driver has to process write1
> > > * before flush2 even though there's no fundamental
> > > * reason flush2 can't be processed first and, when two
> > > * flushes are issued back-to-back without intervening
> > > * writes, the second one essentially becomes noop.
> > > *
> > > * This phenomena becomes quite visible under heavy
> > > * concurrent fsync workload and holding the queue while
> > > * flush is in progress leads to significant throughput
> > > * gain.
> > > */
> >
> > Tejun,
> >
> > I am assuming that these back-to-back flushes are independent of each
> > other otherwise write request will anyway get between two flushes.
> Hi,
> yes, the flushes are independent.
>
> > If that's the case, then should we solve the problem by improving flush
> > merge logic a bit better. (Say idle a bit before issuing a flush only
> > if request queue is not empty).
> I tried some ways to improve flush merge logic. The problem I observed is something like:
> say we have 10 flushes, originally we dispatch 4 flush, write, 6 flush. doing more merge
> we have 6 flush, write, 4 flush. the flush request number sent to drive isn't reduced.

If we try to get rid of WRITE completely between these 10 flushes then we
run the risk of starving other READS/WRITES as long as flushes are going on.

> Another reason I didn't see improvement with better back-to-back merge might be drive
> already optimizes two adjacent flushes case well.

I did not understand this one. With improved back to back merge logic
we have already optimized the flush case. So for 10 flush and one write
we seem to be issuing following (as per your mail).

1 flush (6 flush merged)--> WRITE --> 1flush (4 flush merged).

So where is the opprotinutiy for drive (non flush queuing drive) to optimize
flush here?

IOW, if flush merging is already working well, do we really want to move
in a direction where we can potentially starve other READ/WRITE happening
in an attempt to improve throughput for a sepecific workload.

Thanks
Vivek

2011-05-09 14:37:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [patch v3 2/3] block: hold queue if flush is running for non-queueable flush drive

Hello, Vivek, Shaohua.

On Mon, May 09, 2011 at 09:58:51AM -0400, Vivek Goyal wrote:
> > > I am assuming that these back-to-back flushes are independent of each
> > > other otherwise write request will anyway get between two flushes.

Yeap.

> > > If that's the case, then should we solve the problem by improving flush
> > > merge logic a bit better. (Say idle a bit before issuing a flush only
> > > if request queue is not empty).

Maybe, I don't know. Maybe we can implement some sort of adaptive
delay logic which detects parallel stream of flush requests and try to
insert artificial delays between flush issues to maximize throughput.
With such workload, such optimization probably wouldn't hurt latency
that much either.

However, such heuristics tend to be fairly finicky and result in
unexpected behavior when workload isn't of the expected pattern and
that's why I stuck with mostly mechanical mechanism we currently have
for the initial implementation, but please feel free to play with it.
It would be awesome if such logic can be implemented with something
mechanical which doesn't involve a lot of magic tunables, timers and
pattern detection logics.

> If we try to get rid of WRITE completely between these 10 flushes
> then we run the risk of starving other READS/WRITES as long as
> flushes are going on.

Back-to-back flushes are very cheap. It just costs command
issue/completion roundtrip latency and if that's a problem we can
optimize that out too, so I don't think stream of back-to-back empty
flushes can be a problem. For flushes with data, nothing prevents
other IOs from issued while flush writes are being sequenced. It will
look like,

FLUSH -> FLUSH and other data -> FLUSH -> FLUSH -> FLUSH and other data...

Before this patch, it was

FLUSH -> FLUSH and other data -> FLUSH -> other data -> FLUSH -> FLUSH and other data...

The change isn't gonna cause starvation.

> I did not understand this one. With improved back to back merge logic
> we have already optimized the flush case. So for 10 flush and one write
> we seem to be issuing following (as per your mail).
>
> 1 flush (6 flush merged)--> WRITE --> 1flush (4 flush merged).
>
> So where is the opprotinutiy for drive (non flush queuing drive) to optimize
> flush here?

I suspect the workload doesn't really swamp the device with fsync's
and there frequently are brief holes without any pending flush. In
such cases, flush sequencer will start sequencing immediately and the
following flushes would lose a chance to be merged and the intervening
write causes the performance hit. It would be interesting to
investigate this deeper tho.

> IOW, if flush merging is already working well, do we really want to move
> in a direction where we can potentially starve other READ/WRITE happening
> in an attempt to improve throughput for a sepecific workload.

I don't think it's gonna starve anything and actually like it as it
effectively implements another side of mechnical merging. Before the
patch, it only tried to merge requests which already shared a FLUSH.
ie. It delayed post-FLUSH until all writes which shared pre-FLUSH
finished. This change doesn't introduce any latency on idle queue
while allowing effective merging of flushes which immediately follows
the first one, and there's no knobs to adjust - timings are regulated
by how fast the device can handle flushes and IOs.

So, whether we add further merge strategy or not, I think this one can
serve nicely as one of the base logics and would really like to see
how it affects higher end devices.

Please note that really highend devices with battery backed buffer
wouldn't care about all of this anyway. I think we would mostly be
looking at SCSI hard drives or cheap arrays where I don't think
issuing other IOs together with flush would bring a lot of benefits.
So, I'd like to enable the logic by default unless it actively hurts
those devices.

Thank you.

--
tejun