Alex found a regression when running sysbench fileio test with an ext4
filesystem in a hard disk. The hard disk is attached to an AHCI controller.
The regression is about 15%. He bisected it to 53d63e6b0dfb95882e.
At first glance, it's quite strange the commit can cause any difference,
since q->queue_head usually has just one entry. It looks in SATA normal request
and flush request are exclusive, which causes a lot of requests requeued. From
the log, a flush is finished and then flowed two requests, one is a normal
request and the other flush request. If we let the flush run first, we have a
flush dispatched just after a flush finishes. Assume the second flush can finish
quickly, as the disk cache is already flushed at least most part. Also this delays
normal request, so potentially we do more normal requests before a flush.
Changing the order here should not impact the correctness, because filesystem
should already wait for required normal requests finished.
The patch below recover the regression. we don't change the order if just
finished request isn't flush request to delay flush.
BTW, for SATA-like controller, we can do an optimization. When the running list
of q->flush_queue proceeds, we can proceeds pending list too (that is the two lists
could be merged). Because normal request and flush request are exclusive. When
a flush request is running, there should be no other normal request running.
Don't know if this is worthy, if yes, I can work on it.
Reported-by: Alex Shi <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
diff --git a/block/blk-flush.c b/block/blk-flush.c
index eba4a27..78a88d7 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -89,7 +89,7 @@ enum {
FLUSH_PENDING_TIMEOUT = 5 * HZ,
};
-static bool blk_kick_flush(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q, bool flush_end);
static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
@@ -141,7 +141,7 @@ static void blk_flush_restore_request(struct request *rq)
* %true if requests were added to the dispatch queue, %false otherwise.
*/
static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
- int error)
+ int error, bool flush_end)
{
struct request_queue *q = rq->q;
struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
@@ -187,7 +187,7 @@ static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
BUG();
}
- return blk_kick_flush(q) | queued;
+ return blk_kick_flush(q, flush_end) | queued;
}
static void flush_end_io(struct request *flush_rq, int error)
@@ -208,7 +208,7 @@ static void flush_end_io(struct request *flush_rq, int error)
unsigned int seq = blk_flush_cur_seq(rq);
BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
- queued |= blk_flush_complete_seq(rq, seq, error);
+ queued |= blk_flush_complete_seq(rq, seq, error, true);
}
/*
@@ -234,7 +234,7 @@ static void flush_end_io(struct request *flush_rq, int error)
* RETURNS:
* %true if flush was issued, %false otherwise.
*/
-static bool blk_kick_flush(struct request_queue *q)
+static bool blk_kick_flush(struct request_queue *q, bool flush_end)
{
struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
struct request *first_rq =
@@ -261,7 +261,10 @@ static bool blk_kick_flush(struct request_queue *q)
q->flush_rq.end_io = flush_end_io;
q->flush_pending_idx ^= 1;
- list_add_tail(&q->flush_rq.queuelist, &q->queue_head);
+ if (flush_end)
+ list_add(&q->flush_rq.queuelist, &q->queue_head);
+ else
+ list_add_tail(&q->flush_rq.queuelist, &q->queue_head);
return true;
}
@@ -273,7 +276,7 @@ static void flush_data_end_io(struct request *rq, int error)
* After populating an empty queue, kick it to avoid stall. Read
* the comment in flush_end_io().
*/
- if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error))
+ if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error, false))
__blk_run_queue(q, true);
}
@@ -325,7 +328,7 @@ void blk_insert_flush(struct request *rq)
rq->cmd_flags |= REQ_FLUSH_SEQ;
rq->end_io = flush_data_end_io;
- blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
+ blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0, false);
}
/**
On 2011-04-18 09:36, Shaohua Li wrote:
> Alex found a regression when running sysbench fileio test with an ext4
> filesystem in a hard disk. The hard disk is attached to an AHCI
> controller. The regression is about 15%. He bisected it to
> 53d63e6b0dfb95882e. At first glance, it's quite strange the commit
> can cause any difference, since q->queue_head usually has just one
> entry. It looks in SATA normal request and flush request are
> exclusive, which causes a lot of requests requeued. From the log, a
A flush isn't queueable for SATA, NCQ essentially only really allows
READs and WRITEs to be queued.
> flush is finished and then flowed two requests, one is a normal
> request and the other flush request. If we let the flush run first, we
> have a flush dispatched just after a flush finishes. Assume the second
> flush can finish quickly, as the disk cache is already flushed at
> least most part. Also this delays normal request, so potentially we do
> more normal requests before a flush. Changing the order here should
> not impact the correctness, because filesystem should already wait for
> required normal requests finished. The patch below recover the
> regression. we don't change the order if just finished request isn't
> flush request to delay flush.
It's not about correctness, it's actually a safety concern. True head
additions should be reserved for internal operations, things like error
recovery or eg spinning a disk up for service, power management, etc.
Imagine a driver needing some special operation done before it can do
IO, that is queued at the head. If the flush happens immediately after
and skips to the head, you could get into trouble.
I think we can do this safely if you check what the head request is - if
it's a regular read/write request, then it should be safe to head
insert. That is safe IFF we always wait on requests when they are
ordered, which we do now. But it is a bit ugly...
> BTW, for SATA-like controller, we can do an optimization. When the
> running list of q->flush_queue proceeds, we can proceeds pending list
> too (that is the two lists could be merged). Because normal request
> and flush request are exclusive. When a flush request is running,
> there should be no other normal request running. Don't know if this
> is worthy, if yes, I can work on it.
Might be worth adding something for this special case, seems like the
NCQ restrictions will continue to be around forever (or a long time, at
least).
--
Jens Axboe
On Mon, 2011-04-18 at 16:08 +0800, Jens Axboe wrote:
> On 2011-04-18 09:36, Shaohua Li wrote:
> > Alex found a regression when running sysbench fileio test with an ext4
> > filesystem in a hard disk. The hard disk is attached to an AHCI
> > controller. The regression is about 15%. He bisected it to
> > 53d63e6b0dfb95882e. At first glance, it's quite strange the commit
> > can cause any difference, since q->queue_head usually has just one
> > entry. It looks in SATA normal request and flush request are
> > exclusive, which causes a lot of requests requeued. From the log, a
>
> A flush isn't queueable for SATA, NCQ essentially only really allows
> READs and WRITEs to be queued.
>
> > flush is finished and then flowed two requests, one is a normal
> > request and the other flush request. If we let the flush run first, we
> > have a flush dispatched just after a flush finishes. Assume the second
> > flush can finish quickly, as the disk cache is already flushed at
> > least most part. Also this delays normal request, so potentially we do
> > more normal requests before a flush. Changing the order here should
> > not impact the correctness, because filesystem should already wait for
> > required normal requests finished. The patch below recover the
> > regression. we don't change the order if just finished request isn't
> > flush request to delay flush.
>
> It's not about correctness, it's actually a safety concern. True head
> additions should be reserved for internal operations, things like error
> recovery or eg spinning a disk up for service, power management, etc.
> Imagine a driver needing some special operation done before it can do
> IO, that is queued at the head. If the flush happens immediately after
> and skips to the head, you could get into trouble.
then why requeue adds request at head? we could have the similar issue.
> I think we can do this safely if you check what the head request is - if
> it's a regular read/write request, then it should be safe to head
> insert. That is safe IFF we always wait on requests when they are
> ordered, which we do now. But it is a bit ugly...
hmm, don't want to do this...
> > BTW, for SATA-like controller, we can do an optimization. When the
> > running list of q->flush_queue proceeds, we can proceeds pending list
> > too (that is the two lists could be merged). Because normal request
> > and flush request are exclusive. When a flush request is running,
> > there should be no other normal request running. Don't know if this
> > is worthy, if yes, I can work on it.
>
> Might be worth adding something for this special case, seems like the
> NCQ restrictions will continue to be around forever (or a long time, at
> least).
I'll look at this. Optimizing this one should fix the regression too. On
the other hand, adding flush request at head if it just follows a flush
still has its advantage, because drive cache is already flushed out.
Thanks,
Shaohua
On Mon, Apr 18, 2011 at 10:08:52AM +0200, Jens Axboe wrote:
> Might be worth adding something for this special case, seems like the
> NCQ restrictions will continue to be around forever (or a long time, at
> least).
I heared people are working on adding a queued FLUSH to the standard,
but it's going to take a long time for it to get into real life systems.
What would help now is allowing libata to actually use the FUA bit,
given that every common disk and controller supports it these days.
Shaohua, does adding a
libata.fua = 1
to the kernel command line help your benchmark in any way? It should
if you flushes are mostly from journal writes, but not from fsync
that didn't change any metadata.
On Mon, 2011-04-18 at 17:26 +0800, Christoph Hellwig wrote:
> On Mon, Apr 18, 2011 at 10:08:52AM +0200, Jens Axboe wrote:
> > Might be worth adding something for this special case, seems like the
> > NCQ restrictions will continue to be around forever (or a long time, at
> > least).
>
> I heared people are working on adding a queued FLUSH to the standard,
> but it's going to take a long time for it to get into real life systems.
>
> What would help now is allowing libata to actually use the FUA bit,
> given that every common disk and controller supports it these days.
>
> Shaohua, does adding a
>
> libata.fua = 1
>
> to the kernel command line help your benchmark in any way? It should
> if you flushes are mostly from journal writes, but not from fsync
> that didn't change any metadata.
This is a workload with fsync. I tested libata.fua=1, but nothing changed.
I also hacked the code when we proceed queue running flush list, also
proceed queue pending flush list, which doesn't change correctness for
sata. This improved a little, around 5%, but doesn't recover the whole
regression. so I still need add the flush request at queue head.
Thanks,
Shaohua
Hello,
On Mon, Apr 18, 2011 at 04:25:57PM +0800, Shaohua Li wrote:
> then why requeue adds request at head? we could have the similar issue.
SCSI doesn't seem to do it anymore but it used to cache scmd at
rq->special over requeues so that it doesn't have to re-initialize
requests across requeues, which means that unprepped request getting
ahead of requeued ones may lead to deadlock due to resource
starvation, so that's why requeue uses front queueing.
The code changed over time and the above requirement might not be
necessary at this point. I don't know. However, block layer doesn't
have any method to enforce that requests can't hold any extra resource
on requeue and having such difficult to trigger deadlock condition
dormant is scary.
What kind of benchmarking are we talking about on which kernel?
blk-flush support has been revamped twice recently. 2.6.38 stripped
out the block layer barrier thing and then it got re-reimplemented for
2.6.39 to support advanced flush merging. If the regression (for
which benchmark btw?) was visible on the older reimplementation, I'd
really like to know how it behaves on 2.6.39-rcX.
If the problem is localized to 2.6.38, oh well, too bad, but I don't
think we care too much. If some distro is basing their kernel on
2.6.38 and the flush regression is hurting them, backporting the new
implementation from 2.6.39 shouldn't be too difficult after all. The
reimplementation was almost self-contained.
If the regression affects 2.6.39 implementation too, eh well, we need
to think of something, but I'd really like to know what kind of
workload we're talking about.
> I'll look at this. Optimizing this one should fix the regression too. On
> the other hand, adding flush request at head if it just follows a flush
> still has its advantage, because drive cache is already flushed out.
New implementation wouldn't issue two flushes back to back like that,
it doesn't make any sense to begin with. Again, what have you been
testing and how?
Thanks.
--
tejun
On Sat, 2011-04-23 at 06:57 +0800, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 18, 2011 at 04:25:57PM +0800, Shaohua Li wrote:
> > then why requeue adds request at head? we could have the similar issue.
>
> SCSI doesn't seem to do it anymore but it used to cache scmd at
> rq->special over requeues so that it doesn't have to re-initialize
> requests across requeues, which means that unprepped request getting
> ahead of requeued ones may lead to deadlock due to resource
> starvation, so that's why requeue uses front queueing.
>
> The code changed over time and the above requirement might not be
> necessary at this point. I don't know. However, block layer doesn't
> have any method to enforce that requests can't hold any extra resource
> on requeue and having such difficult to trigger deadlock condition
> dormant is scary.
>
> What kind of benchmarking are we talking about on which kernel?
> blk-flush support has been revamped twice recently. 2.6.38 stripped
> out the block layer barrier thing and then it got re-reimplemented for
> 2.6.39 to support advanced flush merging. If the regression (for
> which benchmark btw?) was visible on the older reimplementation, I'd
> really like to know how it behaves on 2.6.39-rcX.
>
> If the problem is localized to 2.6.38, oh well, too bad, but I don't
> think we care too much. If some distro is basing their kernel on
> 2.6.38 and the flush regression is hurting them, backporting the new
> implementation from 2.6.39 shouldn't be too difficult after all. The
> reimplementation was almost self-contained.
>
> If the regression affects 2.6.39 implementation too, eh well, we need
> to think of something, but I'd really like to know what kind of
> workload we're talking about.
>
> > I'll look at this. Optimizing this one should fix the regression too. On
> > the other hand, adding flush request at head if it just follows a flush
> > still has its advantage, because drive cache is already flushed out.
>
> New implementation wouldn't issue two flushes back to back like that,
> it doesn't make any sense to begin with. Again, what have you been
> testing and how?
Hi Tejun,
this is a regression from 2.6.39-rc2 compared to 2.6.39-rc1, so this
isn't related to the flush rewritten. Workload is sysbench fileio,
please see the first mail at the thread for detail.
Thanks,
Shaohua
Hello, Shaohua.
On Mon, Apr 25, 2011 at 09:01:59AM +0800, Shaohua Li wrote:
> this is a regression from 2.6.39-rc2 compared to 2.6.39-rc1, so this
> isn't related to the flush rewritten. Workload is sysbench fileio,
> please see the first mail at the thread for detail.
Understood. Let's talk on the other thread.
Thanks.
--
tejun
On Mon, 2011-04-25 at 16:21 +0800, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Mon, Apr 25, 2011 at 09:01:59AM +0800, Shaohua Li wrote:
> > this is a regression from 2.6.39-rc2 compared to 2.6.39-rc1, so this
> > isn't related to the flush rewritten. Workload is sysbench fileio,
> > please see the first mail at the thread for detail.
>
> Understood. Let's talk on the other thread.
This issue isn't related to the optimization patch in another thread.
And that patch can't recover the regression, which does improve
throughput even without the regression. So please look at issue again.
Thanks,
Shaohua
Hey,
On Tue, Apr 26, 2011 at 08:49:15AM +0800, Shaohua Li wrote:
> On Mon, 2011-04-25 at 16:21 +0800, Tejun Heo wrote:
> > Hello, Shaohua.
> >
> > On Mon, Apr 25, 2011 at 09:01:59AM +0800, Shaohua Li wrote:
> > > this is a regression from 2.6.39-rc2 compared to 2.6.39-rc1, so this
> > > isn't related to the flush rewritten. Workload is sysbench fileio,
> > > please see the first mail at the thread for detail.
> >
> > Understood. Let's talk on the other thread.
>
> This issue isn't related to the optimization patch in another thread.
> And that patch can't recover the regression, which does improve
> throughput even without the regression. So please look at issue again.
IIUC, the regression happened because, before, back-to-back flushes
were basically optimized out by hardware but, after, due to regular
writes thrown into the mix, aren't. If that's the case, I would still
prefer to solve this from issue side instead of completion if possible
(it might not be tho).
Or is the latency introduced for each flush actually making difference
for the specific benchmark? Hmmm... maybe that's the case given that
your patches merging back-to-back flushes doesn't recover the whole
regression.
I don't know. Darrick, can you please chime in? Do you see
regression between front and back queueing of flushes? The original
thread is
http://thread.gmane.org/gmane.linux.kernel/1127779
and the offending commit is 53d63e6b0dfb95882ec0219ba6bbd50cde423794
(block: make the flush insertion use the tail of the dispatch list).
Thanks.
--
tejun
On Tue, 2011-04-26 at 19:29 +0800, Tejun Heo wrote:
Hi Thejun,
> On Tue, Apr 26, 2011 at 08:49:15AM +0800, Shaohua Li wrote:
> > On Mon, 2011-04-25 at 16:21 +0800, Tejun Heo wrote:
> > > Hello, Shaohua.
> > >
> > > On Mon, Apr 25, 2011 at 09:01:59AM +0800, Shaohua Li wrote:
> > > > this is a regression from 2.6.39-rc2 compared to 2.6.39-rc1, so this
> > > > isn't related to the flush rewritten. Workload is sysbench fileio,
> > > > please see the first mail at the thread for detail.
> > >
> > > Understood. Let's talk on the other thread.
> >
> > This issue isn't related to the optimization patch in another thread.
> > And that patch can't recover the regression, which does improve
> > throughput even without the regression. So please look at issue again.
>
> IIUC, the regression happened because, before, back-to-back flushes
> were basically optimized out by hardware but, after, due to regular
> writes thrown into the mix, aren't. If that's the case, I would still
> prefer to solve this from issue side instead of completion if possible
> (it might not be tho).
>
> Or is the latency introduced for each flush actually making difference
> for the specific benchmark? Hmmm... maybe that's the case given that
> your patches merging back-to-back flushes doesn't recover the whole
> regression.
The reason is quite subtle ad the queue just has two requests at most
time. Say we have f1, w1, f2, f1 is running and f2 is requeued.
Without the regression, we run f1, f2, w1
with the regression, we run f1, w1, f2
Without the regression, f2 finishes quite quickly and it appears we run
more writes before run a flush.
If you want to check the problem, you should use sata, which gives a lot
of requeue. high-end drive might not be good to investigate the problem.
Thanks,
Shaohua