2011-04-19 08:44:52

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 1/2]block: optimize non-queueable flush request drive


In some drives, flush requests are non-queueable. This means when a flush request
is running, normal read/write requests are not. In such drive, when running flush
requests finish, we can make pending flush requests finish. Since normal write
requests are not running, pending flush requests also flush required drive cache
out. This reduces some flush requests running and improve performance.

This patch allows block core utilizes the optimization. Next patch will enable it
for SATA.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-flush.c | 15 +++++++++++++--
include/linux/blkdev.h | 11 +++++++++++
2 files changed, 24 insertions(+), 2 deletions(-)

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c 2011-04-19 09:21:47.000000000 +0800
+++ linux/block/blk-flush.c 2011-04-19 16:38:22.000000000 +0800
@@ -193,18 +193,29 @@ static bool blk_flush_complete_seq(struc
static void flush_end_io(struct request *flush_rq, int error)
{
struct request_queue *q = flush_rq->q;
- struct list_head *running = &q->flush_queue[q->flush_running_idx];
+ LIST_HEAD(proceed_list);
bool queued = false;
struct request *rq, *n;

BUG_ON(q->flush_pending_idx == q->flush_running_idx);

+ list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
+ /*
+ * If queue doesn't support queueable flush request, we can push the
+ * pending requests to the next stage too. For such queue, there are no
+ * normal requests running when flush request is running, so this still
+ * guarantees the correctness.
+ */
+ if (!blk_queue_flush_queueable(q))
+ list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
+ &proceed_list);
+
/* account completion of the flush request */
q->flush_running_idx ^= 1;
elv_completed_request(q, flush_rq);

/* and push the waiting requests to the next stage */
- list_for_each_entry_safe(rq, n, running, flush.list) {
+ list_for_each_entry_safe(rq, n, &proceed_list, flush.list) {
unsigned int seq = blk_flush_cur_seq(rq);

BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h 2011-04-19 09:15:15.000000000 +0800
+++ linux/include/linux/blkdev.h 2011-04-19 10:04:46.000000000 +0800
@@ -366,6 +366,7 @@ struct request_queue
* for flush operations
*/
unsigned int flush_flags;
+ unsigned int flush_not_queueable:1;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
unsigned long flush_pending_since;
@@ -552,6 +553,16 @@ static inline void blk_clear_queue_full(
queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
}

+static inline void blk_set_queue_flush_queueable(struct request_queue *q,
+ bool queueable)
+{
+ q->flush_not_queueable = !queueable;
+}
+
+static inline bool blk_queue_flush_queueable(struct request_queue *q)
+{
+ return !q->flush_not_queueable;
+}

/*
* mergeable request must not have _NOMERGE or _BARRIER bit set, nor may


2011-04-22 23:32:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hello, Shaohua.

> + list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
> + /*
> + * If queue doesn't support queueable flush request, we can push the
> + * pending requests to the next stage too. For such queue, there are no
> + * normal requests running when flush request is running, so this still
> + * guarantees the correctness.
> + */
> + if (!blk_queue_flush_queueable(q))
> + list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
> + &proceed_list);

I can't see how this is safe. Request completion is decoupled from
issue. What prevents low level driver from take in other requests
before control hits here? And even if that holds for the current
implementation, that's hardly something which can be guaranteed from
!flush_queueable. Am I missing something?

This kind of micro optimization is gonna bring very painful bugs which
are extremely difficult to reproduce and track down. It scares the
hell out of me. It's gonna silently skip flushes where it shouldn't.

If you wanna optimize this case, a much better way would be
implementing back-to-back flush optimization properly such that when
block layer detects two flushes back-to-back and _KNOWS_ that no
request has been issued inbetween, the second one is handled as noop.
Mark the queue clean on flush, dirty on any other request and if the
queue is clean all flushes can be completed immediately on issue which
would also allow us to avoid the whole queue at the front or back
issue without bothering low level drivers at all.

Thanks.

--
tejun

2011-04-25 01:33:32

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hi,
On Sat, Apr 23, 2011 at 07:32:04AM +0800, Tejun Heo wrote:
> > + list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
> > + /*
> > + * If queue doesn't support queueable flush request, we can push the
> > + * pending requests to the next stage too. For such queue, there are no
> > + * normal requests running when flush request is running, so this still
> > + * guarantees the correctness.
> > + */
> > + if (!blk_queue_flush_queueable(q))
> > + list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
> > + &proceed_list);
>
> I can't see how this is safe. Request completion is decoupled from
> issue. What prevents low level driver from take in other requests
> before control hits here? And even if that holds for the current
> implementation, that's hardly something which can be guaranteed from
> !flush_queueable. Am I missing something?
Say in one operation of fs, we issue write r1 and r2, after they finishes,
we issue flush f1. In another operation, we issue write r3 and r4, after
they finishes, we issue flush f2.
operation 1: r1 r2 f1
operation 2: r3 r4 f2
At the time f1 finishes and f2 is in queue, we can make sure two things:
1. r3 and r4 is already finished, otherwise f2 will not be queued.
2. r3 and r4 should be finished before f1. We can only deliver one request
out for non-queueable request, so either f1 is dispatched after r3 and r4
are finished or before r3 and r4 are finished. Because of item1, f1 is
dispatched after r3 and r4 are finished.
>From the two items, when f1 is finished, we can let f2 finished, because
f1 should flush disk cache out for all requests from r1 to r4.

> This kind of micro optimization is gonna bring very painful bugs which
> are extremely difficult to reproduce and track down. It scares the
> hell out of me. It's gonna silently skip flushes where it shouldn't.
>
> If you wanna optimize this case, a much better way would be
> implementing back-to-back flush optimization properly such that when
> block layer detects two flushes back-to-back and _KNOWS_ that no
> request has been issued inbetween, the second one is handled as noop.
> Mark the queue clean on flush, dirty on any other request and if the
> queue is clean all flushes can be completed immediately on issue which
> would also allow us to avoid the whole queue at the front or back
> issue without bothering low level drivers at all.
If flush is queueable, I'm not sure if we can do the optimization. For example,
we dispatch 32 requests in the meantime. and the last request is flush, can
the hardware guarantee the cache for the first 31 requests are flushed out?
On the other hand, my optimization works even there are write requests in
between the back-to-back flush.

Thanks,
Shaohua

2011-04-25 08:58:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hello,

(cc'ing Darrick)

On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote:
> Say in one operation of fs, we issue write r1 and r2, after they finishes,
> we issue flush f1. In another operation, we issue write r3 and r4, after
> they finishes, we issue flush f2.
> operation 1: r1 r2 f1
> operation 2: r3 r4 f2
> At the time f1 finishes and f2 is in queue, we can make sure two things:
> 1. r3 and r4 is already finished, otherwise f2 will not be queued.
> 2. r3 and r4 should be finished before f1. We can only deliver one request
> out for non-queueable request, so either f1 is dispatched after r3 and r4
> are finished or before r3 and r4 are finished. Because of item1, f1 is
> dispatched after r3 and r4 are finished.
> From the two items, when f1 is finished, we can let f2 finished, because
> f1 should flush disk cache out for all requests from r1 to r4.

What I was saying is that request completion is decoupled from driver
fetching requests from block layer and that the order of completion
doesn't necessarily follow the order of execution. IOW, nothing
guarantees that FLUSH completion code would run before the low level
driver fetches the next command and _completes_ it, in which case your
code would happily mark flush complete after write without actually
doing it.

And, in general, I feel uncomfortable with this type of approach, it's
extremely fragile and difficult to understand and verify, and doesn't
match at all with the rest of the code. If you think you can exploit
certain ordering constraint, reflect it into the overall design.
Don't stuff the magic into five line out-of-place code.

> If flush is queueable, I'm not sure if we can do the
> optimization. For example, we dispatch 32 requests in the
> meantime. and the last request is flush, can the hardware guarantee
> the cache for the first 31 requests are flushed out? On the other
> hand, my optimization works even there are write requests in between
> the back-to-back flush.

Eh, wasn't your optimization only applicable if flush is not
queueable? IIUC, what your optimization achieves is merging
back-to-back flushes and you're achieving that in a _very_ non-obvious
round-about way. Do it in straight-forward way even if that costs
more lines of code.

Darrick, do you see flush performance regression between rc1 and rc2?
You're testing on higher end, so maybe it's still okay for you?

Thanks.

--
tejun

2011-04-25 09:13:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hello,

On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote:
> Eh, wasn't your optimization only applicable if flush is not
> queueable? IIUC, what your optimization achieves is merging
> back-to-back flushes and you're achieving that in a _very_ non-obvious
> round-about way. Do it in straight-forward way even if that costs
> more lines of code.

To add a bit more, here, flush exclusivity gives you an extra ordering
contraint that while flush is in progress no other request can proceed
and thus if there's another flush queued, it can be completed
together, right? If so, teach block layer the whole thing - let block
layer hold further requests while flush is in progress and complete
back-to-back flushes together on completion and then resume normal
queue processing.

Also, another interesting thing to investigate is why the two flushes
didn't get merged in the first place. The two flushes apparently
didn't have any ordering requirement between them. Why didn't they
get merged in the first place? If the first flush were slightly
delayed, the second would have been issued together from the beginning
and we wouldn't have to think about merging them afterwards. Maybe
what we really need is better algorithm than C1/2/3 described in the
comment?

What did sysbench do in the workload which showed the regression? A
lot of parallel fsyncs combined with writes?

Thanks.

--
tejun

2011-04-26 00:42:44

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

On Mon, 2011-04-25 at 16:58 +0800, Tejun Heo wrote:
> Hello,
>
> (cc'ing Darrick)
>
> On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote:
> > Say in one operation of fs, we issue write r1 and r2, after they finishes,
> > we issue flush f1. In another operation, we issue write r3 and r4, after
> > they finishes, we issue flush f2.
> > operation 1: r1 r2 f1
> > operation 2: r3 r4 f2
> > At the time f1 finishes and f2 is in queue, we can make sure two things:
> > 1. r3 and r4 is already finished, otherwise f2 will not be queued.
> > 2. r3 and r4 should be finished before f1. We can only deliver one request
> > out for non-queueable request, so either f1 is dispatched after r3 and r4
> > are finished or before r3 and r4 are finished. Because of item1, f1 is
> > dispatched after r3 and r4 are finished.
> > From the two items, when f1 is finished, we can let f2 finished, because
> > f1 should flush disk cache out for all requests from r1 to r4.
>
> What I was saying is that request completion is decoupled from driver
> fetching requests from block layer and that the order of completion
> doesn't necessarily follow the order of execution. IOW, nothing
> guarantees that FLUSH completion code would run before the low level
> driver fetches the next command and _completes_ it, in which case your
> code would happily mark flush complete after write without actually
> doing it.
What I described is in the background of non-queueable flush request.
For queueable flush, this definitely isn't correct.

> And, in general, I feel uncomfortable with this type of approach, it's
> extremely fragile and difficult to understand and verify, and doesn't
> match at all with the rest of the code. If you think you can exploit
> certain ordering constraint, reflect it into the overall design.
> Don't stuff the magic into five line out-of-place code.
>
> > If flush is queueable, I'm not sure if we can do the
> > optimization. For example, we dispatch 32 requests in the
> > meantime. and the last request is flush, can the hardware guarantee
> > the cache for the first 31 requests are flushed out? On the other
> > hand, my optimization works even there are write requests in between
> > the back-to-back flush.
>
> Eh, wasn't your optimization only applicable if flush is not
> queueable? IIUC, what your optimization achieves is merging
> back-to-back flushes and you're achieving that in a _very_ non-obvious
> round-about way. Do it in straight-forward way even if that costs
> more lines of code.
This isn't a problem of more code or less code. I thought my patch is
already quite simple.

The method your described only works for non-queueable flush too. And it
has limitation that the requests between two back-to-back flushes must
not be write. my patch works for non-queueable flush but has no such
limitation.

> Darrick, do you see flush performance regression between rc1 and rc2?
> You're testing on higher end, so maybe it's still okay for you?
please ignore the regression. the patch isn't related to the regression,
but that problem motivates me to do the patch.
Actually I still need the RFC patch in another thread to recover the
regression. I hope you and Jens can seriously look at that issue too.

Thanks,
Shaohua

2011-04-26 00:46:34

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

On Mon, 2011-04-25 at 17:13 +0800, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote:
> > Eh, wasn't your optimization only applicable if flush is not
> > queueable? IIUC, what your optimization achieves is merging
> > back-to-back flushes and you're achieving that in a _very_ non-obvious
> > round-about way. Do it in straight-forward way even if that costs
> > more lines of code.
>
> To add a bit more, here, flush exclusivity gives you an extra ordering
> contraint that while flush is in progress no other request can proceed
> and thus if there's another flush queued, it can be completed
> together, right? If so, teach block layer the whole thing - let block
> layer hold further requests while flush is in progress and complete
> back-to-back flushes together on completion and then resume normal
> queue processing.
the blk-flush is part of block layer. If what you mean is the libata
part, block layer doesn't know if flush is queueable without knowledge
from driver.

> Also, another interesting thing to investigate is why the two flushes
> didn't get merged in the first place. The two flushes apparently
> didn't have any ordering requirement between them. Why didn't they
> get merged in the first place? If the first flush were slightly
> delayed, the second would have been issued together from the beginning
> and we wouldn't have to think about merging them afterwards. Maybe
> what we really need is better algorithm than C1/2/3 described in the
> comment?
the sysbench fileio does a 16 threads write-fsync, so it's quite normal
a flush is running and another flush is added into pending list.

Thanks,
Shaohua

2011-04-26 10:40:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hey,

On Tue, Apr 26, 2011 at 08:42:39AM +0800, Shaohua Li wrote:
> > What I was saying is that request completion is decoupled from driver
> > fetching requests from block layer and that the order of completion
> > doesn't necessarily follow the order of execution. IOW, nothing
> > guarantees that FLUSH completion code would run before the low level
> > driver fetches the next command and _completes_ it, in which case your
> > code would happily mark flush complete after write without actually
> > doing it.
>
> What I described is in the background of non-queueable flush request.
> For queueable flush, this definitely isn't correct.

We're definitely having communication issues. The above doesn't have
anything to do with queueability of flushes. It's about the
asynchronous nature of block request completion and issue paths, so it
can happen whether flush is queueable or not, or am I still
misunderstanding you?

> > Eh, wasn't your optimization only applicable if flush is not
> > queueable? IIUC, what your optimization achieves is merging
> > back-to-back flushes and you're achieving that in a _very_ non-obvious
> > round-about way. Do it in straight-forward way even if that costs
> > more lines of code.
>
> This isn't a problem of more code or less code. I thought my patch is
> already quite simple.

Well, then, we'll have to agree to disagree there as it looks really
hackish to me and I don't think it's even correct as written above.

> The method your described only works for non-queueable flush too. And it
> has limitation that the requests between two back-to-back flushes must
> not be write. my patch works for non-queueable flush but has no such
> limitation.

No, I'm saying you can achieve about the same effect in cleaner and
safer way if you teach the issue and completion paths properly about
these back-to-back flushes at the cost of more code changes. Your
patch doesn't work reliably whether flush is queueable or not.

> > Darrick, do you see flush performance regression between rc1 and rc2?
> > You're testing on higher end, so maybe it's still okay for you?
>
> please ignore the regression. the patch isn't related to the regression,
> but that problem motivates me to do the patch.
> Actually I still need the RFC patch in another thread to recover the
> regression. I hope you and Jens can seriously look at that issue too.

Ah, okay, it's a separately issue. Sorry about confusing the two.
I'll continue on another reply.

Thanks.

--
tejun

2011-04-26 10:48:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hey,

On Tue, Apr 26, 2011 at 08:46:30AM +0800, Shaohua Li wrote:
> the blk-flush is part of block layer. If what you mean is the libata
> part, block layer doesn't know if flush is queueable without knowledge
> from driver.

It seems my communication skill towards you sucks badly. I was
talking about making both the issue and completion paths cooperate on
flush sequence handling instead of depending on queuability of flush
to make assumptions on completion order, which I still think is
incorrect.

> > Also, another interesting thing to investigate is why the two flushes
> > didn't get merged in the first place. The two flushes apparently
> > didn't have any ordering requirement between them. Why didn't they
> > get merged in the first place? If the first flush were slightly
> > delayed, the second would have been issued together from the beginning
> > and we wouldn't have to think about merging them afterwards. Maybe
> > what we really need is better algorithm than C1/2/3 described in the
> > comment?
>
> the sysbench fileio does a 16 threads write-fsync, so it's quite normal
> a flush is running and another flush is added into pending list.

I think you're optimizing in the wrong place. These back-to-back
flushes better be merged on the issue side instead of completion. The
current merging rules (basically how long to delay pending flushes)
are minimal and mechanical (timeout is the only huristic parameter).

For the initial implementation, my goal was matching the numbers of
Darrick's original implementation at higher level and keeping things
simple, but the code is intentionally structured to allow easy tuning
of merging criteria, so I suggest looking there instead of mucking
with completion path.

Thank you.

--
tejun

2011-04-28 07:51:00

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

On Tue, 2011-04-26 at 18:48 +0800, Tejun Heo wrote:
> Hey,
>
> On Tue, Apr 26, 2011 at 08:46:30AM +0800, Shaohua Li wrote:
> > the blk-flush is part of block layer. If what you mean is the libata
> > part, block layer doesn't know if flush is queueable without knowledge
> > from driver.
>
> It seems my communication skill towards you sucks badly. I was
> talking about making both the issue and completion paths cooperate on
> flush sequence handling instead of depending on queuability of flush
> to make assumptions on completion order, which I still think is
> incorrect.
>
> > > Also, another interesting thing to investigate is why the two flushes
> > > didn't get merged in the first place. The two flushes apparently
> > > didn't have any ordering requirement between them. Why didn't they
> > > get merged in the first place? If the first flush were slightly
> > > delayed, the second would have been issued together from the beginning
> > > and we wouldn't have to think about merging them afterwards. Maybe
> > > what we really need is better algorithm than C1/2/3 described in the
> > > comment?
> >
> > the sysbench fileio does a 16 threads write-fsync, so it's quite normal
> > a flush is running and another flush is added into pending list.
>
> I think you're optimizing in the wrong place. These back-to-back
> flushes better be merged on the issue side instead of completion. The
> current merging rules (basically how long to delay pending flushes)
> are minimal and mechanical (timeout is the only huristic parameter).
>
> For the initial implementation, my goal was matching the numbers of
> Darrick's original implementation at higher level and keeping things
> simple, but the code is intentionally structured to allow easy tuning
> of merging criteria, so I suggest looking there instead of mucking
> with completion path.
I got your point now. Thanks for the explain. you are right, we should
postpone the normal request to make the optimization safe. Also I merged
the back-to-back flush in issue stage. Below is the updated patch, does
it make sense to you?

Subject: block: optimize non-queueable flush request drive

In some drives, flush requests are non-queueable. This means when a flush
request is running, normal read/write requests are not. In such drive, when
running flush requests finish, we can make pending flush requests finish. Since
normal write requests are not running, pending flush requests also flush
required drive cache out. This reduces some flush requests running and improve
performance.

Tejun pointed out we should hold normal request when flush is running in this
case, otherwise low level driver might fetch next normal request and finish it
before reporting flush request completion. He is right and the patch follows
his suggestions. Holding normal request here doesn't harm performance because
low level driver will requeue normal request anyway even we don't do the
holding. And this avoids some unnecessary request requeue operation too.

This patch allows block core utilizes the optimization. Next patch will enable
it for SATA.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-core.c | 14 ++++++++++++++
block/blk-flush.c | 16 ++++++++++++++++
include/linux/blkdev.h | 17 +++++++++++++++++
3 files changed, 47 insertions(+)

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c 2011-04-28 10:23:12.000000000 +0800
+++ linux/block/blk-flush.c 2011-04-28 14:12:50.000000000 +0800
@@ -158,6 +158,17 @@ static bool blk_flush_complete_seq(struc
switch (seq) {
case REQ_FSEQ_PREFLUSH:
case REQ_FSEQ_POSTFLUSH:
+ /*
+ * If queue doesn't support queueable flush request, we just
+ * merge the flush with running flush. For such queue, there
+ * are no normal requests running when flush request is
+ * running, so this still guarantees the correctness.
+ */
+ if (!blk_queue_flush_queueable(q)) {
+ list_move_tail(&rq->flush.list,
+ &q->flush_queue[q->flush_running_idx]);
+ break;
+ }
/* queue for flush */
if (list_empty(pending))
q->flush_pending_since = jiffies;
@@ -199,6 +210,10 @@ static void flush_end_io(struct request

BUG_ON(q->flush_pending_idx == q->flush_running_idx);

+ q->flush_exclusive_running = 0;
+ queued |= q->flush_queue_delayed;
+ q->flush_queue_delayed = 0;
+
/* account completion of the flush request */
q->flush_running_idx ^= 1;
elv_completed_request(q, flush_rq);
@@ -343,6 +358,7 @@ void blk_abort_flushes(struct request_qu
struct request *rq, *n;
int i;

+ q->flush_exclusive_running = 0;
/*
* Requests in flight for data are already owned by the dispatch
* queue or the device driver. Just restore for normal completion.
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h 2011-04-28 10:23:12.000000000 +0800
+++ linux/include/linux/blkdev.h 2011-04-28 10:32:54.000000000 +0800
@@ -364,6 +364,13 @@ struct request_queue
* for flush operations
*/
unsigned int flush_flags;
+ unsigned int flush_not_queueable:1;
+ /*
+ * flush_exclusive_running and flush_queue_delayed are only meaningful
+ * when flush request isn't queueable
+ */
+ unsigned int flush_exclusive_running:1;
+ unsigned int flush_queue_delayed:1;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
unsigned long flush_pending_since;
@@ -549,6 +556,16 @@ static inline void blk_clear_queue_full(
queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
}

+static inline void blk_set_queue_flush_queueable(struct request_queue *q,
+ bool queueable)
+{
+ q->flush_not_queueable = !queueable;
+}
+
+static inline bool blk_queue_flush_queueable(struct request_queue *q)
+{
+ return !q->flush_not_queueable;
+}

/*
* mergeable request must not have _NOMERGE or _BARRIER bit set, nor may
Index: linux/block/blk-core.c
===================================================================
--- linux.orig/block/blk-core.c 2011-04-28 10:23:12.000000000 +0800
+++ linux/block/blk-core.c 2011-04-28 10:53:18.000000000 +0800
@@ -929,6 +929,8 @@ void blk_requeue_request(struct request_

BUG_ON(blk_queued_rq(rq));

+ if (rq == &q->flush_rq)
+ q->flush_exclusive_running = 0;
elv_requeue_request(q, rq);
}
EXPORT_SYMBOL(blk_requeue_request);
@@ -1847,6 +1849,15 @@ struct request *blk_peek_request(struct
int ret;

while ((rq = __elv_next_request(q)) != NULL) {
+ /*
+ * If flush isn't queueable and is running, we delay normal
+ * requets. Normal requests will get requeued even we don't delay
+ * them and this gives us a chance to batch flush requests.
+ */
+ if (q->flush_exclusive_running) {
+ q->flush_queue_delayed = 1;
+ return NULL;
+ }
if (!(rq->cmd_flags & REQ_STARTED)) {
/*
* This is the first time the device driver
@@ -1961,6 +1972,7 @@ void blk_dequeue_request(struct request
*/
void blk_start_request(struct request *req)
{
+ struct request_queue *q = req->q;
blk_dequeue_request(req);

/*
@@ -1972,6 +1984,8 @@ void blk_start_request(struct request *r
req->next_rq->resid_len = blk_rq_bytes(req->next_rq);

blk_add_timer(req);
+ if (req == &q->flush_rq && !blk_queue_flush_queueable(q))
+ q->flush_exclusive_running = 1;
}
EXPORT_SYMBOL(blk_start_request);


2011-04-30 14:38:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hello, Shaohua.

On Thu, Apr 28, 2011 at 03:50:55PM +0800, Shaohua Li wrote:
> Index: linux/block/blk-flush.c
> ===================================================================
> --- linux.orig/block/blk-flush.c 2011-04-28 10:23:12.000000000 +0800
> +++ linux/block/blk-flush.c 2011-04-28 14:12:50.000000000 +0800
> @@ -158,6 +158,17 @@ static bool blk_flush_complete_seq(struc
> switch (seq) {
> case REQ_FSEQ_PREFLUSH:
> case REQ_FSEQ_POSTFLUSH:
> + /*
> + * If queue doesn't support queueable flush request, we just
> + * merge the flush with running flush. For such queue, there
> + * are no normal requests running when flush request is
> + * running, so this still guarantees the correctness.
> + */
> + if (!blk_queue_flush_queueable(q)) {
> + list_move_tail(&rq->flush.list,
> + &q->flush_queue[q->flush_running_idx]);
> + break;
> + }

As I've said several times already, I really don't like this magic
being done in the completion path. Can't you detect the condition on
issue of the second/following flush and append it to the running list?
If you already have tried that but this way still seems better, can
you please explain why?

Also, this is a separate logic. Please put it in a separate patch.
The first patch should implement queue holding while flushing, which
should remove the regression, right?

The second patch can optimize back-to-back execution, which might or
might not buy us tangible performance gain, so it would be nice to
have some measurement for this change. Also, this logic isn't
necessarily related with queueability of flushes, right? As such, I
think it would be better for it to be implemented separately from the
queueability thing, unless doing such increases complexity too much.

> Index: linux/include/linux/blkdev.h
> ===================================================================
> --- linux.orig/include/linux/blkdev.h 2011-04-28 10:23:12.000000000 +0800
> +++ linux/include/linux/blkdev.h 2011-04-28 10:32:54.000000000 +0800
> @@ -364,6 +364,13 @@ struct request_queue
> * for flush operations
> */
> unsigned int flush_flags;
> + unsigned int flush_not_queueable:1;
> + /*
> + * flush_exclusive_running and flush_queue_delayed are only meaningful
> + * when flush request isn't queueable
> + */
> + unsigned int flush_exclusive_running:1;
> + unsigned int flush_queue_delayed:1;

Hmmm... why do you need separate ->flush_exclusive_running? Doesn't
pending_idx != running_idx already have the same information?

> Index: linux/block/blk-core.c
> ===================================================================
> --- linux.orig/block/blk-core.c 2011-04-28 10:23:12.000000000 +0800
> +++ linux/block/blk-core.c 2011-04-28 10:53:18.000000000 +0800
> @@ -929,6 +929,8 @@ void blk_requeue_request(struct request_
>
> BUG_ON(blk_queued_rq(rq));
>
> + if (rq == &q->flush_rq)
> + q->flush_exclusive_running = 0;

I don't get this either. What's requeueing got to do with it? It
doesn't matter whether flush gets requeued or not. Once flush is put
onto the dispatch list, isn't holding the queue until it's complete
enough?

Thanks.

--
tejun

2011-05-03 06:44:36

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hi,
> On Thu, Apr 28, 2011 at 03:50:55PM +0800, Shaohua Li wrote:
> > Index: linux/block/blk-flush.c
> > ===================================================================
> > --- linux.orig/block/blk-flush.c 2011-04-28 10:23:12.000000000 +0800
> > +++ linux/block/blk-flush.c 2011-04-28 14:12:50.000000000 +0800
> > @@ -158,6 +158,17 @@ static bool blk_flush_complete_seq(struc
> > switch (seq) {
> > case REQ_FSEQ_PREFLUSH:
> > case REQ_FSEQ_POSTFLUSH:
> > + /*
> > + * If queue doesn't support queueable flush request, we just
> > + * merge the flush with running flush. For such queue, there
> > + * are no normal requests running when flush request is
> > + * running, so this still guarantees the correctness.
> > + */
> > + if (!blk_queue_flush_queueable(q)) {
> > + list_move_tail(&rq->flush.list,
> > + &q->flush_queue[q->flush_running_idx]);
> > + break;
> > + }
>
> As I've said several times already, I really don't like this magic
> being done in the completion path. Can't you detect the condition on
> issue of the second/following flush and append it to the running list?
hmm, don't understand it. blk_flush_complete_seq is called when the
second flush is issued. or do you mean do this when the second flush is
issued to disk? but when the second flush is issued the first flush is
already finished.

> If you already have tried that but this way still seems better, can
> you please explain why?
>
> Also, this is a separate logic. Please put it in a separate patch.
> The first patch should implement queue holding while flushing, which
> should remove the regression, right?
ok. holding queue has no performance gain in my test, but it reduced a
lot of request requeue.

> The second patch can optimize back-to-back execution, which might or
> might not buy us tangible performance gain, so it would be nice to
> have some measurement for this change. Also, this logic isn't
> necessarily related with queueability of flushes, right? As such, I
> think it would be better for it to be implemented separately from the
> queueability thing, unless doing such increases complexity too much.
>
> > Index: linux/include/linux/blkdev.h
> > ===================================================================
> > --- linux.orig/include/linux/blkdev.h 2011-04-28 10:23:12.000000000 +0800
> > +++ linux/include/linux/blkdev.h 2011-04-28 10:32:54.000000000 +0800
> > @@ -364,6 +364,13 @@ struct request_queue
> > * for flush operations
> > */
> > unsigned int flush_flags;
> > + unsigned int flush_not_queueable:1;
> > + /*
> > + * flush_exclusive_running and flush_queue_delayed are only meaningful
> > + * when flush request isn't queueable
> > + */
> > + unsigned int flush_exclusive_running:1;
> > + unsigned int flush_queue_delayed:1;
>
> Hmmm... why do you need separate ->flush_exclusive_running? Doesn't
> pending_idx != running_idx already have the same information?
when pending_idx != running_idx, flush request is added into queue tail,
but this doesn't mean flush request is dispatched to disk. there might
be other requests in the queue head, which we should dispatch. And flush
request might be reqeueud. Just checking pending_idx != running_idx will
cause queue hang because we thought flush is dispatched and then hold
the queue, but actually flush isn't dispatched yet, the queue should
dispatch other normal requests.

Thanks,
Shaohua

2011-05-03 08:23:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

Hello,

On Tue, May 03, 2011 at 02:44:31PM +0800, Shaohua Li wrote:
> > As I've said several times already, I really don't like this magic
> > being done in the completion path. Can't you detect the condition on
> > issue of the second/following flush and append it to the running list?
>
> hmm, don't understand it. blk_flush_complete_seq is called when the
> second flush is issued. or do you mean do this when the second flush is
> issued to disk? but when the second flush is issued the first flush is
> already finished.

Ah, okay, my bad. That's the next sequence logic, so the right place.
Still, please do the followings.

* Put it in a separate patch.

* Preferably, detect the actual condition (back to back flush) rather
than the queueability test unless it's too complicated.

* Please make pending/running paths look more symmetrical.

> > If you already have tried that but this way still seems better, can
> > you please explain why?
> >
> > Also, this is a separate logic. Please put it in a separate patch.
> > The first patch should implement queue holding while flushing, which
> > should remove the regression, right?
>
> ok. holding queue has no performance gain in my test, but it reduced a
> lot of request requeue.

No, holding the queue should remove the regression completely. Please
read on.

> > Hmmm... why do you need separate ->flush_exclusive_running? Doesn't
> > pending_idx != running_idx already have the same information?
>
> when pending_idx != running_idx, flush request is added into queue tail,
> but this doesn't mean flush request is dispatched to disk. there might
> be other requests in the queue head, which we should dispatch. And flush
> request might be reqeueud. Just checking pending_idx != running_idx will
> cause queue hang because we thought flush is dispatched and then hold
> the queue, but actually flush isn't dispatched yet, the queue should
> dispatch other normal requests.

Don't hold elv_next_request(). Hold ->elevator_dispatch_fn().

Thanks.

--
tejun

2011-05-04 06:20:19

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/2]block: optimize non-queueable flush request drive

On Tue, 2011-05-03 at 16:23 +0800, Tejun Heo wrote:
> Hello,
>
> On Tue, May 03, 2011 at 02:44:31PM +0800, Shaohua Li wrote:
> > > As I've said several times already, I really don't like this magic
> > > being done in the completion path. Can't you detect the condition on
> > > issue of the second/following flush and append it to the running list?
> >
> > hmm, don't understand it. blk_flush_complete_seq is called when the
> > second flush is issued. or do you mean do this when the second flush is
> > issued to disk? but when the second flush is issued the first flush is
> > already finished.
>
> Ah, okay, my bad. That's the next sequence logic, so the right place.
> Still, please do the followings.
>
> * Put it in a separate patch.
>
> * Preferably, detect the actual condition (back to back flush) rather
> than the queueability test unless it's too complicated.
>
> * Please make pending/running paths look more symmetrical.
I retested, and appears just holding queue is already good enough. After
holding queue, merging back to back flush hasn't too much benefit. So
I'll not pursue do the back-to-back merge. I'll post my latest patches
out soon.

> > > If you already have tried that but this way still seems better, can
> > > you please explain why?
> > >
> > > Also, this is a separate logic. Please put it in a separate patch.
> > > The first patch should implement queue holding while flushing, which
> > > should remove the regression, right?
> >
> > ok. holding queue has no performance gain in my test, but it reduced a
> > lot of request requeue.
>
> No, holding the queue should remove the regression completely. Please
> read on.
>
> > > Hmmm... why do you need separate ->flush_exclusive_running? Doesn't
> > > pending_idx != running_idx already have the same information?
> >
> > when pending_idx != running_idx, flush request is added into queue tail,
> > but this doesn't mean flush request is dispatched to disk. there might
> > be other requests in the queue head, which we should dispatch. And flush
> > request might be reqeueud. Just checking pending_idx != running_idx will
> > cause queue hang because we thought flush is dispatched and then hold
> > the queue, but actually flush isn't dispatched yet, the queue should
> > dispatch other normal requests.
>
> Don't hold elv_next_request(). Hold ->elevator_dispatch_fn().
ok, this works.

Thanks,
Shaohua