On certain types of storage hardware, flushing the write cache takes a
considerable amount of time. Typically, these are simple storage systems with
write cache enabled and no battery to save that cache during a power failure.
When we encounter a system with many I/O threads that try to flush the cache,
performance is suboptimal because each of those threads issues its own flush
command to the drive instead of trying to coordinate the flushes, thereby
wasting execution time.
Instead of each thread initiating its own flush, we now try to detect the
situation where multiple threads are issuing flush requests. The first thread
to enter blkdev_issue_flush becomes the owner of the flush, and all threads
that enter blkdev_issue_flush before the flush finishes are queued up to wait
for the next flush. When that first flush finishes, one of those sleeping
threads is woken up to perform the next flush and then wake up the other
threads which are asleep waiting for the second flush to finish.
In the single-threaded case, the thread will simply issue the flush and exit.
To test the performance of this latest patch, I created a spreadsheet
reflecting the performance numbers I obtained with the same ffsb fsync-happy
workload that I've been running all along: http://tinyurl.com/6xqk5bs
The second tab of the workbook provides easy comparisons of the performance
before and after adding flush coordination to the block layer. Variations in
the runs were never more than about 5%, so the slight performance increases and
decreases are negligible. It is expected that devices with low flush times
should not show much change, whether the low flush times are due to the lack of
write cache or the controller having a battery and thereby ignoring the flush
command.
Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
from flush coordination. These 6 configurations all feature large write caches
without battery backups, and fairly high (or at least non-zero) average flush
times, as was discovered when I studied the v6 patch.
Unfortunately, there is one very odd regression: elm3c44_sas. This profile is
a couple of battery-backed RAID cabinets striped together with raid0 on md. I
suspect that there is some sort of problematic interaction with md, because
running ffsb on the individual hardware arrays produces numbers similar to
elm3c71_extsas. elm3c71_extsas uses the same type of hardware array as does
elm3c44_sas, in fact.
FYI, the flush coordination patch shows performance improvements both with and
without Christoph's patch that issues pure flushes directly. The spreadsheet
only captures the performance numbers collected without Christoph's patch.
Signed-off-by: Darrick J. Wong <[email protected]>
---
block/blk-flush.c | 137 +++++++++++++++++++++++++++++++++++++++++--------
block/genhd.c | 12 ++++
include/linux/genhd.h | 15 +++++
3 files changed, 143 insertions(+), 21 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 2402a34..d6c9931 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -201,6 +201,60 @@ static void bio_end_flush(struct bio *bio, int err)
bio_put(bio);
}
+static int blkdev_issue_flush_now(struct block_device *bdev,
+ gfp_t gfp_mask, sector_t *error_sector)
+{
+ DECLARE_COMPLETION_ONSTACK(wait);
+ struct bio *bio;
+ int ret = 0;
+
+ bio = bio_alloc(gfp_mask, 0);
+ bio->bi_end_io = bio_end_flush;
+ bio->bi_bdev = bdev;
+ bio->bi_private = &wait;
+
+ bio_get(bio);
+ submit_bio(WRITE_FLUSH, bio);
+ wait_for_completion(&wait);
+
+ /*
+ * The driver must store the error location in ->bi_sector, if
+ * it supports it. For non-stacked drivers, this should be
+ * copied from blk_rq_pos(rq).
+ */
+ if (error_sector)
+ *error_sector = bio->bi_sector;
+
+ if (!bio_flagged(bio, BIO_UPTODATE))
+ ret = -EIO;
+
+ bio_put(bio);
+ return ret;
+}
+
+struct flush_completion_t *alloc_flush_completion(gfp_t gfp_mask)
+{
+ struct flush_completion_t *t;
+
+ t = kzalloc(sizeof(*t), gfp_mask);
+ if (!t)
+ return t;
+
+ init_completion(&t->ready);
+ init_completion(&t->finish);
+ kref_init(&t->ref);
+
+ return t;
+}
+
+void free_flush_completion(struct kref *ref)
+{
+ struct flush_completion_t *f;
+
+ f = container_of(ref, struct flush_completion_t, ref);
+ kfree(f);
+}
+
/**
* blkdev_issue_flush - queue a flush
* @bdev: blockdev to issue flush for
@@ -216,18 +270,22 @@ static void bio_end_flush(struct bio *bio, int err)
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
sector_t *error_sector)
{
- DECLARE_COMPLETION_ONSTACK(wait);
+ struct flush_completion_t *flush, *new_flush;
struct request_queue *q;
- struct bio *bio;
+ struct gendisk *disk;
int ret = 0;
- if (bdev->bd_disk == NULL)
+ disk = bdev->bd_disk;
+ if (disk == NULL)
return -ENXIO;
q = bdev_get_queue(bdev);
if (!q)
return -ENXIO;
+ if (!(q->flush_flags & REQ_FLUSH))
+ return -ENXIO;
+
/*
* some block devices may not have their queue correctly set up here
* (e.g. loop device without a backing file) and so issuing a flush
@@ -237,27 +295,64 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
if (!q->make_request_fn)
return -ENXIO;
- bio = bio_alloc(gfp_mask, 0);
- bio->bi_end_io = bio_end_flush;
- bio->bi_bdev = bdev;
- bio->bi_private = &wait;
-
- bio_get(bio);
- submit_bio(WRITE_FLUSH, bio);
- wait_for_completion(&wait);
+ /* coordinate flushes */
+ new_flush = alloc_flush_completion(gfp_mask);
+ if (!new_flush)
+ return -ENOMEM;
+
+again:
+ spin_lock(&disk->flush_flag_lock);
+ if (disk->in_flush) {
+ /* Flush in progress */
+ kref_get(&disk->next_flush->ref);
+ flush = disk->next_flush;
+ ret = atomic_read(&flush->waiters);
+ atomic_inc(&flush->waiters);
+ spin_unlock(&disk->flush_flag_lock);
- /*
- * The driver must store the error location in ->bi_sector, if
- * it supports it. For non-stacked drivers, this should be
- * copied from blk_rq_pos(rq).
- */
- if (error_sector)
- *error_sector = bio->bi_sector;
+ /*
+ * If there aren't any waiters, this thread will be woken
+ * up to start the next flush.
+ */
+ if (!ret) {
+ wait_for_completion(&flush->ready);
+ kref_put(&flush->ref, free_flush_completion);
+ goto again;
+ }
- if (!bio_flagged(bio, BIO_UPTODATE))
- ret = -EIO;
+ /* Otherwise, just wait for this flush to end. */
+ ret = wait_for_completion_killable(&flush->finish);
+ if (!ret)
+ ret = flush->ret;
+ kref_put(&flush->ref, free_flush_completion);
+ kref_put(&new_flush->ref, free_flush_completion);
+ } else {
+ /* no flush in progress */
+ flush = disk->next_flush;
+ disk->next_flush = new_flush;
+ disk->in_flush = 1;
+ spin_unlock(&disk->flush_flag_lock);
+
+ ret = blkdev_issue_flush_now(bdev, gfp_mask, error_sector);
+ flush->ret = ret;
+
+ /* Wake up the thread that starts the next flush. */
+ spin_lock(&disk->flush_flag_lock);
+ disk->in_flush = 0;
+ /*
+ * This line must be between the zeroing of in_flush and the
+ * spin_unlock because we don't want the next-flush thread to
+ * start messing with pointers until we're safely out of this
+ * section. It must be the first complete_all, because on some
+ * fast devices, the next flush finishes (and frees
+ * next_flush!) before the second complete_all finishes!
+ */
+ complete_all(&new_flush->ready);
+ spin_unlock(&disk->flush_flag_lock);
- bio_put(bio);
+ complete_all(&flush->finish);
+ kref_put(&flush->ref, free_flush_completion);
+ }
return ret;
}
EXPORT_SYMBOL(blkdev_issue_flush);
diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..d239eda 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1009,6 +1009,7 @@ static void disk_release(struct device *dev)
disk_replace_part_tbl(disk, NULL);
free_part_stats(&disk->part0);
free_part_info(&disk->part0);
+ kref_put(&disk->next_flush->ref, free_flush_completion);
kfree(disk);
}
struct class block_class = {
@@ -1177,17 +1178,24 @@ EXPORT_SYMBOL(alloc_disk);
struct gendisk *alloc_disk_node(int minors, int node_id)
{
struct gendisk *disk;
+ struct flush_completion_t *t;
+
+ t = alloc_flush_completion(GFP_KERNEL);
+ if (!t)
+ return NULL;
disk = kmalloc_node(sizeof(struct gendisk),
GFP_KERNEL | __GFP_ZERO, node_id);
if (disk) {
if (!init_part_stats(&disk->part0)) {
+ kfree(t);
kfree(disk);
return NULL;
}
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(&disk->part0);
+ kfree(t);
kfree(disk);
return NULL;
}
@@ -1200,6 +1208,10 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
device_initialize(disk_to_dev(disk));
INIT_WORK(&disk->async_notify,
media_change_notify_thread);
+
+ disk->in_flush = 0;
+ spin_lock_init(&disk->flush_flag_lock);
+ disk->next_flush = t;
}
return disk;
}
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7a7b9c1..564a1d1 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -16,6 +16,16 @@
#ifdef CONFIG_BLOCK
+struct flush_completion_t {
+ struct completion ready;
+ struct completion finish;
+ int ret;
+ atomic_t waiters;
+ struct kref ref;
+};
+extern struct flush_completion_t *alloc_flush_completion(gfp_t gfp_mask);
+extern void free_flush_completion(struct kref *ref);
+
#define kobj_to_dev(k) container_of((k), struct device, kobj)
#define dev_to_disk(device) container_of((device), struct gendisk, part0.__dev)
#define dev_to_part(device) container_of((device), struct hd_struct, __dev)
@@ -178,6 +188,11 @@ struct gendisk {
struct blk_integrity *integrity;
#endif
int node_id;
+
+ /* flush coordination */
+ spinlock_t flush_flag_lock;
+ int in_flush;
+ struct flush_completion_t *next_flush;
};
static inline struct gendisk *part_to_disk(struct hd_struct *part)
2011/1/13 Darrick J. Wong <[email protected]>:
> On certain types of storage hardware, flushing the write cache takes a
> considerable amount of time. ?Typically, these are simple storage systems with
> write cache enabled and no battery to save that cache during a power failure.
> When we encounter a system with many I/O threads that try to flush the cache,
> performance is suboptimal because each of those threads issues its own flush
> command to the drive instead of trying to coordinate the flushes, thereby
> wasting execution time.
>
> Instead of each thread initiating its own flush, we now try to detect the
> situation where multiple threads are issuing flush requests. ?The first thread
> to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> that enter blkdev_issue_flush before the flush finishes are queued up to wait
> for the next flush. ?When that first flush finishes, one of those sleeping
> threads is woken up to perform the next flush and then wake up the other
> threads which are asleep waiting for the second flush to finish.
>
> In the single-threaded case, the thread will simply issue the flush and exit.
>
> To test the performance of this latest patch, I created a spreadsheet
> reflecting the performance numbers I obtained with the same ffsb fsync-happy
> workload that I've been running all along: ?http://tinyurl.com/6xqk5bs
>
> The second tab of the workbook provides easy comparisons of the performance
> before and after adding flush coordination to the block layer. ?Variations in
> the runs were never more than about 5%, so the slight performance increases and
> decreases are negligible. ?It is expected that devices with low flush times
> should not show much change, whether the low flush times are due to the lack of
> write cache or the controller having a battery and thereby ignoring the flush
> command.
>
> Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
> elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
> from flush coordination. ?These 6 configurations all feature large write caches
> without battery backups, and fairly high (or at least non-zero) average flush
> times, as was discovered when I studied the v6 patch.
>
> Unfortunately, there is one very odd regression: elm3c44_sas. ?This profile is
> a couple of battery-backed RAID cabinets striped together with raid0 on md. ?I
> suspect that there is some sort of problematic interaction with md, because
> running ffsb on the individual hardware arrays produces numbers similar to
> elm3c71_extsas. ?elm3c71_extsas uses the same type of hardware array as does
> elm3c44_sas, in fact.
>
> FYI, the flush coordination patch shows performance improvements both with and
> without Christoph's patch that issues pure flushes directly. ?The spreadsheet
> only captures the performance numbers collected without Christoph's patch.
Hi,
can you explain why there is improvement with your patch? If there are
multiple flush, blk_do_flush already has queue for them (the
->pending_flushes list).
Thanks,
Shaohua
On Thu, Jan 13, 2011 at 01:38:55PM +0800, Shaohua Li wrote:
> 2011/1/13 Darrick J. Wong <[email protected]>:
> > On certain types of storage hardware, flushing the write cache takes a
> > considerable amount of time. ?Typically, these are simple storage systems with
> > write cache enabled and no battery to save that cache during a power failure.
> > When we encounter a system with many I/O threads that try to flush the cache,
> > performance is suboptimal because each of those threads issues its own flush
> > command to the drive instead of trying to coordinate the flushes, thereby
> > wasting execution time.
> >
> > Instead of each thread initiating its own flush, we now try to detect the
> > situation where multiple threads are issuing flush requests. ?The first thread
> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
> > for the next flush. ?When that first flush finishes, one of those sleeping
> > threads is woken up to perform the next flush and then wake up the other
> > threads which are asleep waiting for the second flush to finish.
> >
> > In the single-threaded case, the thread will simply issue the flush and exit.
> >
> > To test the performance of this latest patch, I created a spreadsheet
> > reflecting the performance numbers I obtained with the same ffsb fsync-happy
> > workload that I've been running all along: ?http://tinyurl.com/6xqk5bs
> >
> > The second tab of the workbook provides easy comparisons of the performance
> > before and after adding flush coordination to the block layer. ?Variations in
> > the runs were never more than about 5%, so the slight performance increases and
> > decreases are negligible. ?It is expected that devices with low flush times
> > should not show much change, whether the low flush times are due to the lack of
> > write cache or the controller having a battery and thereby ignoring the flush
> > command.
> >
> > Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
> > elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
> > from flush coordination. ?These 6 configurations all feature large write caches
> > without battery backups, and fairly high (or at least non-zero) average flush
> > times, as was discovered when I studied the v6 patch.
> >
> > Unfortunately, there is one very odd regression: elm3c44_sas. ?This profile is
> > a couple of battery-backed RAID cabinets striped together with raid0 on md. ?I
> > suspect that there is some sort of problematic interaction with md, because
> > running ffsb on the individual hardware arrays produces numbers similar to
> > elm3c71_extsas. ?elm3c71_extsas uses the same type of hardware array as does
> > elm3c44_sas, in fact.
> >
> > FYI, the flush coordination patch shows performance improvements both with and
> > without Christoph's patch that issues pure flushes directly. ?The spreadsheet
> > only captures the performance numbers collected without Christoph's patch.
> Hi,
> can you explain why there is improvement with your patch? If there are
> multiple flush, blk_do_flush already has queue for them (the
> ->pending_flushes list).
With the current code, if we have n threads trying to issue flushes, the block
layer will issue n flushes one after the other. I think the point of
Christoph's pure-flush patch is to skip the serialization step and allow
issuing of the n pure flushes in parallel. The point of this patch is optimize
even more aggressively, such that as soon as the system becomes free to process
a pure flush (at time t), all the requests for a pure flush that were created
since the last time a pure flush was actually issued can be covered with a
single flush issued at time t. In other words, this patch tries to reduce the
number of pure flushes issued.
--D
Hello, Darrick.
On Wed, Jan 12, 2011 at 06:56:46PM -0800, Darrick J. Wong wrote:
> On certain types of storage hardware, flushing the write cache takes a
> considerable amount of time. Typically, these are simple storage systems with
> write cache enabled and no battery to save that cache during a power failure.
> When we encounter a system with many I/O threads that try to flush the cache,
> performance is suboptimal because each of those threads issues its own flush
> command to the drive instead of trying to coordinate the flushes, thereby
> wasting execution time.
>
> Instead of each thread initiating its own flush, we now try to detect the
> situation where multiple threads are issuing flush requests. The first thread
> to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> that enter blkdev_issue_flush before the flush finishes are queued up to wait
> for the next flush. When that first flush finishes, one of those sleeping
> threads is woken up to perform the next flush and then wake up the other
> threads which are asleep waiting for the second flush to finish.
Nice work. :-)
> block/blk-flush.c | 137 +++++++++++++++++++++++++++++++++++++++++--------
> block/genhd.c | 12 ++++
> include/linux/genhd.h | 15 +++++
> 3 files changed, 143 insertions(+), 21 deletions(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 2402a34..d6c9931 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -201,6 +201,60 @@ static void bio_end_flush(struct bio *bio, int err)
> bio_put(bio);
> }
>
> +static int blkdev_issue_flush_now(struct block_device *bdev,
> + gfp_t gfp_mask, sector_t *error_sector)
> +{
> + DECLARE_COMPLETION_ONSTACK(wait);
> + struct bio *bio;
> + int ret = 0;
> +
> + bio = bio_alloc(gfp_mask, 0);
> + bio->bi_end_io = bio_end_flush;
> + bio->bi_bdev = bdev;
> + bio->bi_private = &wait;
> +
> + bio_get(bio);
> + submit_bio(WRITE_FLUSH, bio);
> + wait_for_completion(&wait);
> +
> + /*
> + * The driver must store the error location in ->bi_sector, if
> + * it supports it. For non-stacked drivers, this should be
> + * copied from blk_rq_pos(rq).
> + */
> + if (error_sector)
> + *error_sector = bio->bi_sector;
> +
> + if (!bio_flagged(bio, BIO_UPTODATE))
> + ret = -EIO;
> +
> + bio_put(bio);
> + return ret;
> +}
But wouldn't it be better to implement this directly in the flush
machinary instead of as blkdev_issue_flush() wrapper? We have all the
information at the request queue level so we can easily detect whether
flushes can be merged or not and whether something is issued by
blkdev_issue_flush() or by directly submitting bio wouldn't matter at
all. Am I missing something?
Thanks.
--
tejun
On Thu, Jan 13, 2011 at 11:42:40AM +0100, Tejun Heo wrote:
> Hello, Darrick.
>
> On Wed, Jan 12, 2011 at 06:56:46PM -0800, Darrick J. Wong wrote:
> > On certain types of storage hardware, flushing the write cache takes a
> > considerable amount of time. Typically, these are simple storage systems with
> > write cache enabled and no battery to save that cache during a power failure.
> > When we encounter a system with many I/O threads that try to flush the cache,
> > performance is suboptimal because each of those threads issues its own flush
> > command to the drive instead of trying to coordinate the flushes, thereby
> > wasting execution time.
> >
> > Instead of each thread initiating its own flush, we now try to detect the
> > situation where multiple threads are issuing flush requests. The first thread
> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
> > for the next flush. When that first flush finishes, one of those sleeping
> > threads is woken up to perform the next flush and then wake up the other
> > threads which are asleep waiting for the second flush to finish.
>
> Nice work. :-)
Thank you!
> > block/blk-flush.c | 137 +++++++++++++++++++++++++++++++++++++++++--------
> > block/genhd.c | 12 ++++
> > include/linux/genhd.h | 15 +++++
> > 3 files changed, 143 insertions(+), 21 deletions(-)
> >
> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index 2402a34..d6c9931 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -201,6 +201,60 @@ static void bio_end_flush(struct bio *bio, int err)
> > bio_put(bio);
> > }
> >
> > +static int blkdev_issue_flush_now(struct block_device *bdev,
> > + gfp_t gfp_mask, sector_t *error_sector)
> > +{
> > + DECLARE_COMPLETION_ONSTACK(wait);
> > + struct bio *bio;
> > + int ret = 0;
> > +
> > + bio = bio_alloc(gfp_mask, 0);
> > + bio->bi_end_io = bio_end_flush;
> > + bio->bi_bdev = bdev;
> > + bio->bi_private = &wait;
> > +
> > + bio_get(bio);
> > + submit_bio(WRITE_FLUSH, bio);
> > + wait_for_completion(&wait);
> > +
> > + /*
> > + * The driver must store the error location in ->bi_sector, if
> > + * it supports it. For non-stacked drivers, this should be
> > + * copied from blk_rq_pos(rq).
> > + */
> > + if (error_sector)
> > + *error_sector = bio->bi_sector;
> > +
> > + if (!bio_flagged(bio, BIO_UPTODATE))
> > + ret = -EIO;
> > +
> > + bio_put(bio);
> > + return ret;
> > +}
>
> But wouldn't it be better to implement this directly in the flush
> machinary instead of as blkdev_issue_flush() wrapper? We have all the
> information at the request queue level so we can easily detect whether
> flushes can be merged or not and whether something is issued by
> blkdev_issue_flush() or by directly submitting bio wouldn't matter at
> all. Am I missing something?
Could you clarify where exactly you meant by "in the flush machinery"? I think
what you're suggesting is that blk_flush_complete_seq could scan the pending
flush list to find a run of consecutive pure flush requests, submit the last
one, and set things up so that during the completion of the flush, all the
requests in that run are returned with the real flush's return code.
If that's what you meant, then yes, it could be done that way. However, I have
a few reasons for choosing the blkdev_issue_flush approach. First, as far as I
could tell in the kernel, all the code paths that involve upper layers issuing
pure flushes go through blkdev_issue_flush, so it seems like a convenient place
to capture all the incoming pure flushes. Other parts of the kernel issue
writes with the flush flag set, but we want those to go through the regular
queuing mechanisms anyway. Second, with the proposed patch, any upper-layer
code that has a requirement for a pure flush to be issued without coordination
can do so by submitting the bio directly (or blkdev_issue_flush_now can be
exported). Third, baking the coordination into the flush machinery makes that
machinery more complicated to understand and maintain.
So in short, I went with the blkdev_issue_flush approach because the code is
easier to understand, and it's not losing any pure flushes despite casting a
narrower net.
I was also wondering, how many others are testing these flush-pain-reduction
patches? It would be nice to know about wider testing than just my lab. :)
--D
2011/1/13 Darrick J. Wong <[email protected]>:
> On Thu, Jan 13, 2011 at 01:38:55PM +0800, Shaohua Li wrote:
>> 2011/1/13 Darrick J. Wong <[email protected]>:
>> > On certain types of storage hardware, flushing the write cache takes a
>> > considerable amount of time. ?Typically, these are simple storage systems with
>> > write cache enabled and no battery to save that cache during a power failure.
>> > When we encounter a system with many I/O threads that try to flush the cache,
>> > performance is suboptimal because each of those threads issues its own flush
>> > command to the drive instead of trying to coordinate the flushes, thereby
>> > wasting execution time.
>> >
>> > Instead of each thread initiating its own flush, we now try to detect the
>> > situation where multiple threads are issuing flush requests. ?The first thread
>> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
>> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
>> > for the next flush. ?When that first flush finishes, one of those sleeping
>> > threads is woken up to perform the next flush and then wake up the other
>> > threads which are asleep waiting for the second flush to finish.
>> >
>> > In the single-threaded case, the thread will simply issue the flush and exit.
>> >
>> > To test the performance of this latest patch, I created a spreadsheet
>> > reflecting the performance numbers I obtained with the same ffsb fsync-happy
>> > workload that I've been running all along: ?http://tinyurl.com/6xqk5bs
>> >
>> > The second tab of the workbook provides easy comparisons of the performance
>> > before and after adding flush coordination to the block layer. ?Variations in
>> > the runs were never more than about 5%, so the slight performance increases and
>> > decreases are negligible. ?It is expected that devices with low flush times
>> > should not show much change, whether the low flush times are due to the lack of
>> > write cache or the controller having a battery and thereby ignoring the flush
>> > command.
>> >
>> > Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
>> > elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
>> > from flush coordination. ?These 6 configurations all feature large write caches
>> > without battery backups, and fairly high (or at least non-zero) average flush
>> > times, as was discovered when I studied the v6 patch.
>> >
>> > Unfortunately, there is one very odd regression: elm3c44_sas. ?This profile is
>> > a couple of battery-backed RAID cabinets striped together with raid0 on md. ?I
>> > suspect that there is some sort of problematic interaction with md, because
>> > running ffsb on the individual hardware arrays produces numbers similar to
>> > elm3c71_extsas. ?elm3c71_extsas uses the same type of hardware array as does
>> > elm3c44_sas, in fact.
>> >
>> > FYI, the flush coordination patch shows performance improvements both with and
>> > without Christoph's patch that issues pure flushes directly. ?The spreadsheet
>> > only captures the performance numbers collected without Christoph's patch.
>> Hi,
>> can you explain why there is improvement with your patch? If there are
>> multiple flush, blk_do_flush already has queue for them (the
>> ->pending_flushes list).
>
> With the current code, if we have n threads trying to issue flushes, the block
> layer will issue n flushes one after the other. ?I think the point of
> Christoph's pure-flush patch is to skip the serialization step and allow
> issuing of the n pure flushes in parallel. ?The point of this patch is optimize
> even more aggressively, such that as soon as the system becomes free to process
> a pure flush (at time t), all the requests for a pure flush that were created
> since the last time a pure flush was actually issued can be covered with a
> single flush issued at time t. ?In other words, this patch tries to reduce the
> number of pure flushes issued.
Thanks, but why just doing merge for pure flush? we can merge normal
flush requests
with a pure flush request too.
+ complete_all(&new_flush->ready);
+ spin_unlock(&disk->flush_flag_lock);
- bio_put(bio);
+ complete_all(&flush->finish);
this seems can't guarantee the second waiter runs after the first
waiter, am I missing anything?
it appears we can easily implement this in blk_do_flush, I had
something at my hand too, passed test but no data yet.
On Fri, Jan 14, 2011 at 09:00:22AM +0800, Shaohua Li wrote:
> 2011/1/13 Darrick J. Wong <[email protected]>:
> > On Thu, Jan 13, 2011 at 01:38:55PM +0800, Shaohua Li wrote:
> >> 2011/1/13 Darrick J. Wong <[email protected]>:
> >> > On certain types of storage hardware, flushing the write cache takes a
> >> > considerable amount of time. ?Typically, these are simple storage systems with
> >> > write cache enabled and no battery to save that cache during a power failure.
> >> > When we encounter a system with many I/O threads that try to flush the cache,
> >> > performance is suboptimal because each of those threads issues its own flush
> >> > command to the drive instead of trying to coordinate the flushes, thereby
> >> > wasting execution time.
> >> >
> >> > Instead of each thread initiating its own flush, we now try to detect the
> >> > situation where multiple threads are issuing flush requests. ?The first thread
> >> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> >> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
> >> > for the next flush. ?When that first flush finishes, one of those sleeping
> >> > threads is woken up to perform the next flush and then wake up the other
> >> > threads which are asleep waiting for the second flush to finish.
> >> >
> >> > In the single-threaded case, the thread will simply issue the flush and exit.
> >> >
> >> > To test the performance of this latest patch, I created a spreadsheet
> >> > reflecting the performance numbers I obtained with the same ffsb fsync-happy
> >> > workload that I've been running all along: ?http://tinyurl.com/6xqk5bs
> >> >
> >> > The second tab of the workbook provides easy comparisons of the performance
> >> > before and after adding flush coordination to the block layer. ?Variations in
> >> > the runs were never more than about 5%, so the slight performance increases and
> >> > decreases are negligible. ?It is expected that devices with low flush times
> >> > should not show much change, whether the low flush times are due to the lack of
> >> > write cache or the controller having a battery and thereby ignoring the flush
> >> > command.
> >> >
> >> > Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
> >> > elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
> >> > from flush coordination. ?These 6 configurations all feature large write caches
> >> > without battery backups, and fairly high (or at least non-zero) average flush
> >> > times, as was discovered when I studied the v6 patch.
> >> >
> >> > Unfortunately, there is one very odd regression: elm3c44_sas. ?This profile is
> >> > a couple of battery-backed RAID cabinets striped together with raid0 on md. ?I
> >> > suspect that there is some sort of problematic interaction with md, because
> >> > running ffsb on the individual hardware arrays produces numbers similar to
> >> > elm3c71_extsas. ?elm3c71_extsas uses the same type of hardware array as does
> >> > elm3c44_sas, in fact.
> >> >
> >> > FYI, the flush coordination patch shows performance improvements both with and
> >> > without Christoph's patch that issues pure flushes directly. ?The spreadsheet
> >> > only captures the performance numbers collected without Christoph's patch.
> >> Hi,
> >> can you explain why there is improvement with your patch? If there are
> >> multiple flush, blk_do_flush already has queue for them (the
> >> ->pending_flushes list).
> >
> > With the current code, if we have n threads trying to issue flushes, the block
> > layer will issue n flushes one after the other. ?I think the point of
> > Christoph's pure-flush patch is to skip the serialization step and allow
> > issuing of the n pure flushes in parallel. ?The point of this patch is optimize
> > even more aggressively, such that as soon as the system becomes free to process
> > a pure flush (at time t), all the requests for a pure flush that were created
> > since the last time a pure flush was actually issued can be covered with a
> > single flush issued at time t. ?In other words, this patch tries to reduce the
> > number of pure flushes issued.
> Thanks, but why just doing merge for pure flush? we can merge normal
> flush requests
> with a pure flush request too.
>
> + complete_all(&new_flush->ready);
> + spin_unlock(&disk->flush_flag_lock);
>
> - bio_put(bio);
> + complete_all(&flush->finish);
> this seems can't guarantee the second waiter runs after the first
> waiter, am I missing anything?
The first complete_all wakes up the thread that starts the next flush. The
second complete_all wakes up the threads that were waiting to hear the results
of the flush issued by the current thread. The second set of wakeups can
happen in parallel with that first wakeup since they're tied to different flush
contexts.
Maybe a simpler way to express that is that first waiter is the thread that is
woken up by the first complete_all. That first waiter then executes the flush
and then executes the second complete_all, thereby waking up the second waiter.
That's how the second waiter only runs after the first waiter has finished the
flush. After the point where either waiter can be running, the only tasks left
to do are to collect the flush return code and to clean up, and that part can
happen in any order.
> it appears we can easily implement this in blk_do_flush, I had
> something at my hand too, passed test but no data yet.
> Task1: ...FS.....C
> Task2: ....F......S...C
> Task3: ......F.........S..C
> F means a flush is queued, S means a flush is dispatched, C means the flush
> is completed. In above, when Task2's flush is completed, we actually can
> make Task3 flush complete, as Task2's flush is dispatched after Task3's flush
> is queued. With the same reason, we can't merge Task2 and Task3's flush with
> Task1's.
Conceptually this seems to make sense. I gave your patch a spin with
fsync-happy ffsb. It ran ok for about 15 minutes but then the machine locked
hard when I ran mkfs on a different device. I'll try to reproduce it and
capture the machine state, and when I do I'll report back.
--D
> ---
> block/blk-core.c | 1 +
> block/blk-flush.c | 24 ++++++++++++++++++++++++
> include/linux/blkdev.h | 2 ++
> 3 files changed, 27 insertions(+)
>
> Index: linux/block/blk-core.c
> ===================================================================
> --- linux.orig/block/blk-core.c 2011-01-13 21:02:24.000000000 +0800
> +++ linux/block/blk-core.c 2011-01-13 21:43:03.000000000 +0800
> @@ -128,6 +128,7 @@ void blk_rq_init(struct request_queue *q
> rq->ref_count = 1;
> rq->start_time = jiffies;
> set_start_time_ns(rq);
> + rq->next_batched_flush = NULL;
> }
> EXPORT_SYMBOL(blk_rq_init);
>
> Index: linux/block/blk-flush.c
> ===================================================================
> --- linux.orig/block/blk-flush.c 2011-01-13 21:02:24.000000000 +0800
> +++ linux/block/blk-flush.c 2011-01-14 08:22:56.000000000 +0800
> @@ -42,8 +42,18 @@ static struct request *blk_flush_complet
> /* not complete yet, queue the next flush sequence */
> next_rq = queue_next_fseq(q);
> } else {
> + struct request *tmp_rq = q->orig_flush_rq->next_batched_flush;
> + struct request *tmp_rq2;
> + sector_t error_sector = q->orig_flush_rq->bio->bi_sector;
> +
> /* complete this flush request */
> __blk_end_request_all(q->orig_flush_rq, q->flush_err);
> + while (tmp_rq) {
> + tmp_rq2 = tmp_rq->next_batched_flush;
> + tmp_rq->bio->bi_sector = error_sector;
> + __blk_end_request_all(tmp_rq, q->flush_err);
> + tmp_rq = tmp_rq2;
> + }
> q->orig_flush_rq = NULL;
> q->flush_seq = 0;
>
> @@ -164,6 +174,20 @@ struct request *blk_do_flush(struct requ
> * processing.
> */
> if (q->flush_seq) {
> + struct request *tmp_rq;
> +
> + if ((rq->cmd_flags & REQ_FUA) || blk_rq_sectors(rq))
> + goto add_list;
> + list_for_each_entry(tmp_rq, &q->pending_flushes, queuelist) {
> + if (tmp_rq->cmd_flags & REQ_FLUSH) {
> + rq->next_batched_flush =
> + tmp_rq->next_batched_flush;
> + tmp_rq->next_batched_flush = rq;
> + list_del_init(&rq->queuelist);
> + return NULL;
> + }
> + }
> +add_list:
> list_move_tail(&rq->queuelist, &q->pending_flushes);
> return NULL;
> }
> Index: linux/include/linux/blkdev.h
> ===================================================================
> --- linux.orig/include/linux/blkdev.h 2011-01-13 21:02:24.000000000 +0800
> +++ linux/include/linux/blkdev.h 2011-01-13 21:43:03.000000000 +0800
> @@ -163,6 +163,8 @@ struct request {
>
> /* for bidi */
> struct request *next_rq;
> +
> + struct request *next_batched_flush;
> };
>
> static inline unsigned short req_get_ioprio(struct request *req)
Hello,
On Thu, Jan 13, 2011 at 10:59:46AM -0800, Darrick J. Wong wrote:
> > But wouldn't it be better to implement this directly in the flush
> > machinary instead of as blkdev_issue_flush() wrapper? We have all the
> > information at the request queue level so we can easily detect whether
> > flushes can be merged or not and whether something is issued by
> > blkdev_issue_flush() or by directly submitting bio wouldn't matter at
> > all. Am I missing something?
>
> Could you clarify where exactly you meant by "in the flush
> machinery"? I think what you're suggesting is that
> blk_flush_complete_seq could scan the pending flush list to find a
> run of consecutive pure flush requests, submit the last one, and set
> things up so that during the completion of the flush, all the
> requests in that run are returned with the real flush's return code.
Yeah, something along that line.
> If that's what you meant, then yes, it could be done that way.
> However, I have a few reasons for choosing the blkdev_issue_flush
> approach. First, as far as I could tell in the kernel, all the code
> paths that involve upper layers issuing pure flushes go through
> blkdev_issue_flush, so it seems like a convenient place to capture
> all the incoming pure flushes.
That means it _can_ be done there but doesn't mean it's the best spot.
> Other parts of the kernel issue writes with the flush flag set, but
> we want those to go through the regular queuing mechanisms anyway.
> Second, with the proposed patch, any upper-layer code that has a
> requirement for a pure flush to be issued without coordination can
> do so by submitting the bio directly (or blkdev_issue_flush_now can
> be exported).
I don't think anyone would need such capability but even if someone
does that should be implemented as a separate explicit
DONT_MERGE_THIS_FLUSH flag, not by exploting subtle inconsistencies on
the wrapper interface.
> Third, baking the coordination into the flush machinery makes that
> machinery more complicated to understand and maintain.
And the third point is completely nuts. That's like saying putting
things related together makes the concentrated code more complex, so
let's scatter things all over the place. No, it's not making the code
more complex. It's putting the complexity where it belongs. It
decreases maintenance overhead by increasing consistency. Not only
that it even results in visibly more consistent and sane _behavior_
and the said complex machinary is less than 300 lines long.
> So in short, I went with the blkdev_issue_flush approach because the
> code is easier to understand, and it's not losing any pure flushes
> despite casting a narrower net.
No, not at all. You're adding an unobvious logic into a wrapper
interface creating incosistent behavior and creating artificial
distinctions between pure and non-pure flushes and flushes issued by
bio and the wrapper interface. Come on. Think about it again. You
can't be seriously standing by the above rationales.
Thanks.
--
tejun
Hello,
On Fri, Jan 14, 2011 at 09:00:22AM +0800, Shaohua Li wrote:
> it appears we can easily implement this in blk_do_flush, I had
> something at my hand too, passed test but no data yet.
> Task1: ...FS.....C
> Task2: ....F......S...C
> Task3: ......F.........S..C
> F means a flush is queued, S means a flush is dispatched, C means the flush
> is completed. In above, when Task2's flush is completed, we actually can
> make Task3 flush complete, as Task2's flush is dispatched after Task3's flush
> is queued. With the same reason, we can't merge Task2 and Task3's flush with
> Task1's.
I think this is the correct direction but we can take it further. The
only thing block layer has to guarantee is (ignoring FUA support),
* If the flush request is empty, at least one flush is executed upon
its completion.
* If the flush request is not empty, the payload is written only after
a preceding flush and follwed by another flush if requested.
So, if we can combine N flushes with or without data,
PREFLUSH -> N flush data payloads -> POSTFLUSH
The following is something I hacked past few hours. I just made it
compile so it's definitely horridly broken. Please don't try to even
test it, but it should still give an idea about the direction. I'll
try to finish this early next week.
Thank you.
Index: work/block/cfq-iosched.c
===================================================================
--- work.orig/block/cfq-iosched.c
+++ work/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8)
#define RQ_CIC(rq) \
- ((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private2)
-#define RQ_CFQG(rq) (struct cfq_group *) ((rq)->elevator_private3)
+ ((struct cfq_io_context *) (rq)->elevator_private[0])
+#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private[1])
+#define RQ_CFQG(rq) (struct cfq_group *) ((rq)->elevator_private[2])
static struct kmem_cache *cfq_pool;
static struct kmem_cache *cfq_ioc_pool;
@@ -3609,12 +3609,12 @@ static void cfq_put_request(struct reque
put_io_context(RQ_CIC(rq)->ioc);
- rq->elevator_private = NULL;
- rq->elevator_private2 = NULL;
+ rq->elevator_private[0] = NULL;
+ rq->elevator_private[1] = NULL;
/* Put down rq reference on cfqg */
cfq_put_cfqg(RQ_CFQG(rq));
- rq->elevator_private3 = NULL;
+ rq->elevator_private[2] = NULL;
cfq_put_queue(cfqq);
}
@@ -3702,9 +3702,9 @@ new_queue:
cfqq->allocated[rw]++;
cfqq->ref++;
- rq->elevator_private = cic;
- rq->elevator_private2 = cfqq;
- rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+ rq->elevator_private[0] = cic;
+ rq->elevator_private[1] = cfqq;
+ rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
spin_unlock_irqrestore(q->queue_lock, flags);
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q,
q->elevator->ops->elevator_add_req_fn(q, rq);
break;
+ case ELEVATOR_INSERT_FLUSH:
+ rq->cmd_flags |= REQ_SOFTBARRIER;
+ blk_insert_flush(rq);
+ break;
+
default:
printk(KERN_ERR "%s: bad insertion point %d\n",
__func__, where);
@@ -725,12 +730,15 @@ int elv_queue_empty(struct request_queue
struct elevator_queue *e = q->elevator;
if (!list_empty(&q->queue_head))
- return 0;
+ return false;
+
+ if (!q->flush_seq && !list_empty(&q->pending_flushes))
+ return false;
if (e->ops->elevator_queue_empty_fn)
return e->ops->elevator_queue_empty_fn(q);
- return 1;
+ return true;
}
EXPORT_SYMBOL(elv_queue_empty);
@@ -759,7 +767,7 @@ int elv_set_request(struct request_queue
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
- rq->elevator_private = NULL;
+ memset(rq->elevator_private, 0, sizeof(rq->elevator_private));
return 0;
}
@@ -781,21 +789,25 @@ int elv_may_queue(struct request_queue *
return ELV_MQUEUE_MAY;
}
+static void elv_abort_request(struct request_queue *q, struct request *rq)
+{
+ rq->cmd_flags |= REQ_QUIET;
+ trace_block_rq_abort(q, rq);
+ /*
+ * Mark this request as started so we don't trigger any debug logic
+ * in the end I/O path.
+ */
+ blk_start_request(rq);
+ __blk_end_request_all(rq, -EIO);
+}
+
void elv_abort_queue(struct request_queue *q)
{
- struct request *rq;
+ while (!list_empty(&q->queue_head))
+ elv_abort_request(q, list_entry_rq(q->queue_head.next));
- while (!list_empty(&q->queue_head)) {
- rq = list_entry_rq(q->queue_head.next);
- rq->cmd_flags |= REQ_QUIET;
- trace_block_rq_abort(q, rq);
- /*
- * Mark this request as started so we don't trigger
- * any debug logic in the end I/O path.
- */
- blk_start_request(rq);
- __blk_end_request_all(rq, -EIO);
- }
+ while (!list_empty(&q->pending_flushes))
+ elv_abort_request(q, list_entry_rq(q->queue_head.next));
}
EXPORT_SYMBOL(elv_abort_queue);
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -110,9 +110,10 @@ struct request {
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it.
*/
- void *elevator_private;
- void *elevator_private2;
- void *elevator_private3;
+ union {
+ void *elevator_private[3];
+ struct list_head flush_list;
+ };
struct gendisk *rq_disk;
struct hd_struct *part;
@@ -364,10 +365,11 @@ struct request_queue
*/
unsigned int flush_flags;
unsigned int flush_seq;
+ int flush_data_cnt;
int flush_err;
struct request flush_rq;
- struct request *orig_flush_rq;
struct list_head pending_flushes;
+ struct list_head running_flushes;
struct mutex sysfs_lock;
Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -151,27 +151,27 @@ static void req_bio_endio(struct request
{
struct request_queue *q = rq->q;
- if (&q->flush_rq != rq) {
- if (error)
- clear_bit(BIO_UPTODATE, &bio->bi_flags);
- else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
- error = -EIO;
-
- if (unlikely(nbytes > bio->bi_size)) {
- printk(KERN_ERR "%s: want %u bytes done, %u left\n",
- __func__, nbytes, bio->bi_size);
- nbytes = bio->bi_size;
- }
+ if (error)
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+ error = -EIO;
+
+ if (unlikely(nbytes > bio->bi_size)) {
+ printk(KERN_ERR "%s: want %u bytes done, %u left\n",
+ __func__, nbytes, bio->bi_size);
+ nbytes = bio->bi_size;
+ }
- if (unlikely(rq->cmd_flags & REQ_QUIET))
- set_bit(BIO_QUIET, &bio->bi_flags);
+ if (unlikely(rq->cmd_flags & REQ_QUIET))
+ set_bit(BIO_QUIET, &bio->bi_flags);
- bio->bi_size -= nbytes;
- bio->bi_sector += (nbytes >> 9);
+ bio->bi_size -= nbytes;
+ bio->bi_sector += (nbytes >> 9);
- if (bio_integrity(bio))
- bio_integrity_advance(bio, nbytes);
+ if (bio_integrity(bio))
+ bio_integrity_advance(bio, nbytes);
+ if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) {
if (bio->bi_size == 0)
bio_endio(bio, error);
} else {
@@ -1219,7 +1219,7 @@ static int __make_request(struct request
spin_lock_irq(q->queue_lock);
if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
- where = ELEVATOR_INSERT_FRONT;
+ where = ELEVATOR_INSERT_FLUSH;
goto get_rq;
}
Index: work/block/blk-flush.c
===================================================================
--- work.orig/block/blk-flush.c
+++ work/block/blk-flush.c
@@ -16,9 +16,12 @@ enum {
QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
QUEUE_FSEQ_DONE = (1 << 4),
+
+ QUEUE_FSEQ_ACTIONS = QUEUE_FSEQ_PREFLUSH | QUEUE_FSEQ_DATA |
+ QUEUE_FSEQ_POSTFLUSH,
};
-static struct request *queue_next_fseq(struct request_queue *q);
+static void queue_next_fseq(struct request_queue *q);
unsigned blk_flush_cur_seq(struct request_queue *q)
{
@@ -27,10 +30,10 @@ unsigned blk_flush_cur_seq(struct reques
return 1 << ffz(q->flush_seq);
}
-static struct request *blk_flush_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+static void blk_flush_complete_seq(struct request_queue *q, unsigned seq,
+ int error)
{
- struct request *next_rq = NULL;
+ struct request *rq, *n;
if (error && !q->flush_err)
q->flush_err = error;
@@ -40,35 +43,28 @@ static struct request *blk_flush_complet
if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
/* not complete yet, queue the next flush sequence */
- next_rq = queue_next_fseq(q);
+ queue_next_fseq(q);
} else {
- /* complete this flush request */
- __blk_end_request_all(q->orig_flush_rq, q->flush_err);
- q->orig_flush_rq = NULL;
+ /* complete all running flush requests */
+ list_for_each_entry_safe(rq, n, &q->running_flushes, flush_list)
+ __blk_end_request_all(rq, q->flush_err);
+ INIT_LIST_HEAD(&q->running_flushes);
q->flush_seq = 0;
-
- /* dispatch the next flush if there's one */
- if (!list_empty(&q->pending_flushes)) {
- next_rq = list_entry_rq(q->pending_flushes.next);
- list_move(&next_rq->queuelist, &q->queue_head);
- }
}
- return next_rq;
}
static void blk_flush_complete_seq_end_io(struct request_queue *q,
unsigned seq, int error)
{
bool was_empty = elv_queue_empty(q);
- struct request *next_rq;
- next_rq = blk_flush_complete_seq(q, seq, error);
+ blk_flush_complete_seq(q, seq, error);
/*
* Moving a request silently to empty queue_head may stall the
* queue. Kick the queue in those cases.
*/
- if (was_empty && next_rq)
+ if (was_empty && !elv_queue_empty(q))
__blk_run_queue(q);
}
@@ -81,7 +77,12 @@ static void pre_flush_end_io(struct requ
static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
+
+ rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+ rq->end_io = NULL;
+
+ if (!--rq->q->flush_data_cnt)
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
}
static void post_flush_end_io(struct request *rq, int error)
@@ -93,13 +94,13 @@ static void post_flush_end_io(struct req
static void init_flush_request(struct request *rq, struct gendisk *disk)
{
rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = WRITE_FLUSH;
+ rq->cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
rq->rq_disk = disk;
}
-static struct request *queue_next_fseq(struct request_queue *q)
+static void queue_next_fseq(struct request_queue *q)
{
- struct request *orig_rq = q->orig_flush_rq;
+ struct request *orig_rq = list_entry_rq(q->running_flushes.next);
struct request *rq = &q->flush_rq;
blk_rq_init(q, rq);
@@ -110,17 +111,18 @@ static struct request *queue_next_fseq(s
rq->end_io = pre_flush_end_io;
break;
case QUEUE_FSEQ_DATA:
- init_request_from_bio(rq, orig_rq->bio);
- /*
- * orig_rq->rq_disk may be different from
- * bio->bi_bdev->bd_disk if orig_rq got here through
- * remapping drivers. Make sure rq->rq_disk points
- * to the same one as orig_rq.
- */
- rq->rq_disk = orig_rq->rq_disk;
- rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
- rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
- rq->end_io = flush_data_end_io;
+ list_for_each_entry_reverse(rq, &q->running_flushes,
+ flush_list) {
+ WARN_ON_ONCE(rq->end_io || rq->end_io_data);
+
+ if (!blk_rq_sectors(rq))
+ continue;
+
+ rq->cmd_flags |= REQ_FLUSH_SEQ;
+ rq->end_io = flush_data_end_io;
+ list_add(&rq->queuelist, &q->queue_head);
+ q->flush_data_cnt++;
+ }
break;
case QUEUE_FSEQ_POSTFLUSH:
init_flush_request(rq, orig_rq->rq_disk);
@@ -131,64 +133,80 @@ static struct request *queue_next_fseq(s
}
elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- return rq;
}
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
- unsigned int fflags = q->flush_flags; /* may change, cache it */
- bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
- bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
- bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
- unsigned skip = 0;
+ unsigned int policy = 0;
+
+ if (fflags & REQ_FLUSH) {
+ if (rq->cmd_flags & REQ_FLUSH)
+ policy |= QUEUE_FSEQ_PREFLUSH;
+ if (blk_rq_sectors(rq))
+ policy |= QUEUE_FSEQ_DATA;
+ if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+ policy |= QUEUE_FSEQ_POSTFLUSH;
+ }
+ return policy;
+}
+
+void blk_insert_flush(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ unsigned int fflags = q->flush_flags;
/*
- * Special case. If there's data but flush is not necessary,
- * the request can be issued directly.
+ * If there's data but flush is not necessary, the request can be
+ * issued directly.
*
* Flush w/o data should be able to be issued directly too but
* currently some drivers assume that rq->bio contains
* non-zero data if it isn't NULL and empty FLUSH requests
* getting here usually have bio's without data.
*/
- if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
+ if (blk_flush_policy(fflags, rq) &
+ (QUEUE_FSEQ_PREFLUSH | QUEUE_FSEQ_POSTFLUSH)) {
+ INIT_LIST_HEAD(&rq->flush_list);
+ list_add_tail(&rq->flush_list, &q->pending_flushes);
+ } else {
rq->cmd_flags &= ~REQ_FLUSH;
- if (!has_fua)
+ if (!(fflags & REQ_FUA))
rq->cmd_flags &= ~REQ_FUA;
- return rq;
+ list_add(&rq->queuelist, &q->queue_head);
}
+}
- /*
- * Sequenced flushes can't be processed in parallel. If
- * another one is already in progress, queue for later
- * processing.
- */
- if (q->flush_seq) {
- list_move_tail(&rq->queuelist, &q->pending_flushes);
- return NULL;
- }
+bool blk_do_flush(struct request_queue *q)
+{
+ unsigned int fflags = q->flush_flags; /* may change, cache it */
+ unsigned int skip = QUEUE_FSEQ_ACTIONS;
+ struct request *rq;
- /*
- * Start a new flush sequence
- */
+ /* sequenced flushes can't be processed in parallel */
+ if (q->flush_seq)
+ return false;
+
+ /* start a new flush sequence */
+ q->flush_data_cnt = 0;
q->flush_err = 0;
q->flush_seq |= QUEUE_FSEQ_STARTED;
- /* adjust FLUSH/FUA of the original request and stash it away */
- rq->cmd_flags &= ~REQ_FLUSH;
- if (!has_fua)
- rq->cmd_flags &= ~REQ_FUA;
- blk_dequeue_request(rq);
- q->orig_flush_rq = rq;
-
- /* skip unneded sequences and return the first one */
- if (!do_preflush)
- skip |= QUEUE_FSEQ_PREFLUSH;
- if (!blk_rq_sectors(rq))
- skip |= QUEUE_FSEQ_DATA;
- if (!do_postflush)
- skip |= QUEUE_FSEQ_POSTFLUSH;
- return blk_flush_complete_seq(q, skip, 0);
+ /*
+ * Collect what needs to be done, adjust REQ_FLUSH/FUA and stash
+ * away the original flush requests.
+ */
+ list_splice_init(&q->pending_flushes, &q->running_flushes);
+ list_for_each_entry(rq, &q->running_flushes, flush_list) {
+ skip &= ~blk_flush_policy(fflags, rq);
+
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!(fflags & REQ_FUA))
+ rq->cmd_flags &= ~REQ_FUA;
+ }
+
+ /* skip unneded sequences and queue the first one */
+ blk_flush_complete_seq(q, skip, 0);
+ return !q->flush_seq;
}
static void bio_end_flush(struct bio *bio, int err)
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,21 +51,20 @@ static inline void blk_clear_rq_complete
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+bool blk_do_flush(struct request_queue *q);
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->pending_flushes) && blk_do_flush(q))
+ continue;
+
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- rq == &q->flush_rq)
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struc
#define ELEVATOR_INSERT_BACK 2
#define ELEVATOR_INSERT_SORT 3
#define ELEVATOR_INSERT_REQUEUE 4
+#define ELEVATOR_INSERT_FLUSH 5
/*
* return values from elevator_may_queue_fn
Index: work/include/linux/blk_types.h
===================================================================
--- work.orig/include/linux/blk_types.h
+++ work/include/linux/blk_types.h
@@ -148,6 +148,7 @@ enum rq_flag_bits {
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_COPY_USER, /* contains copies of user pages */
__REQ_FLUSH, /* request for cache flush */
+ __REQ_FLUSH_SEQ, /* request for flush sequence */
__REQ_IO_STAT, /* account I/O stat */
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_SECURE, /* secure discard (used with __REQ_DISCARD) */
@@ -188,6 +189,7 @@ enum rq_flag_bits {
#define REQ_ALLOCED (1 << __REQ_ALLOCED)
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
#define REQ_FLUSH (1 << __REQ_FLUSH)
+#define REQ_FLUSH_SEQ (1 << __REQ_FLUSH_SEQ)
#define REQ_IO_STAT (1 << __REQ_IO_STAT)
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
2011/1/16 Tejun Heo <[email protected]>:
> Hello,
>
> On Fri, Jan 14, 2011 at 09:00:22AM +0800, Shaohua Li wrote:
>> it appears we can easily implement this in blk_do_flush, I had
>> something at my hand too, passed test but no data yet.
>
>> Task1: ...FS.....C
>> Task2: ....F......S...C
>> Task3: ......F.........S..C
>> F means a flush is queued, S means a flush is dispatched, C means the flush
>> is completed. In above, when Task2's flush is completed, we actually can
>> make Task3 flush complete, as Task2's flush is dispatched after Task3's flush
>> is queued. With the same reason, we can't merge Task2 and Task3's flush with
>> Task1's.
>
> I think this is the correct direction but we can take it further. ?The
> only thing block layer has to guarantee is (ignoring FUA support),
>
> * If the flush request is empty, at least one flush is executed upon
> ?its completion.
>
> * If the flush request is not empty, the payload is written only after
> ?a preceding flush and follwed by another flush if requested.
>
> So, if we can combine N flushes with or without data,
>
> ?PREFLUSH -> N flush data payloads -> POSTFLUSH
This makes sense. would it possible N flush data payloads delay the
first request?
> The following is something I hacked past few hours. ?I just made it
> compile so it's definitely horridly broken. ?Please don't try to even
> test it, but it should still give an idea about the direction. ?I'll
> try to finish this early next week.
I'm looking forward to seeing it.
Thanks,
Shaohua
On Sat, Jan 15, 2011 at 02:33:44PM +0100, Tejun Heo wrote:
> Hello,
>
> On Thu, Jan 13, 2011 at 10:59:46AM -0800, Darrick J. Wong wrote:
> > > But wouldn't it be better to implement this directly in the flush
> > > machinary instead of as blkdev_issue_flush() wrapper? We have all the
> > > information at the request queue level so we can easily detect whether
> > > flushes can be merged or not and whether something is issued by
> > > blkdev_issue_flush() or by directly submitting bio wouldn't matter at
> > > all. Am I missing something?
> >
> > Could you clarify where exactly you meant by "in the flush
> > machinery"? I think what you're suggesting is that
> > blk_flush_complete_seq could scan the pending flush list to find a
> > run of consecutive pure flush requests, submit the last one, and set
> > things up so that during the completion of the flush, all the
> > requests in that run are returned with the real flush's return code.
>
> Yeah, something along that line.
>
> > If that's what you meant, then yes, it could be done that way.
> > However, I have a few reasons for choosing the blkdev_issue_flush
> > approach. First, as far as I could tell in the kernel, all the code
> > paths that involve upper layers issuing pure flushes go through
> > blkdev_issue_flush, so it seems like a convenient place to capture
> > all the incoming pure flushes.
>
> That means it _can_ be done there but doesn't mean it's the best spot.
>
> > Other parts of the kernel issue writes with the flush flag set, but
> > we want those to go through the regular queuing mechanisms anyway.
> > Second, with the proposed patch, any upper-layer code that has a
> > requirement for a pure flush to be issued without coordination can
> > do so by submitting the bio directly (or blkdev_issue_flush_now can
> > be exported).
>
> I don't think anyone would need such capability but even if someone
> does that should be implemented as a separate explicit
> DONT_MERGE_THIS_FLUSH flag, not by exploting subtle inconsistencies on
> the wrapper interface.
>
> > Third, baking the coordination into the flush machinery makes that
> > machinery more complicated to understand and maintain.
>
> And the third point is completely nuts. That's like saying putting
> things related together makes the concentrated code more complex, so
> let's scatter things all over the place. No, it's not making the code
> more complex. It's putting the complexity where it belongs. It
> decreases maintenance overhead by increasing consistency. Not only
> that it even results in visibly more consistent and sane _behavior_
> and the said complex machinary is less than 300 lines long.
>
> > So in short, I went with the blkdev_issue_flush approach because the
> > code is easier to understand, and it's not losing any pure flushes
> > despite casting a narrower net.
>
> No, not at all. You're adding an unobvious logic into a wrapper
> interface creating incosistent behavior and creating artificial
> distinctions between pure and non-pure flushes and flushes issued by
> bio and the wrapper interface. Come on. Think about it again. You
> can't be seriously standing by the above rationales.
Since you're the primary author of that file anyway, I'll defer to your
experience. I can squeeze in a few test runs of whatever patches you propose
to the mailing list.
--D
Hello,
On Tue, Jan 18, 2011 at 09:12:55AM +0800, Shaohua Li wrote:
> This makes sense. would it possible N flush data payloads delay the
> first request?
I ended up with a different design. Still buggy. It triggers a weird
oops under stress test but other than that things generally seem to
work as expected. Please read the comment at the top of blk-flush.c
for more info. I'll post properly after more testing and debugging.
Thanks.
Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -149,39 +149,29 @@ EXPORT_SYMBOL(blk_rq_init);
static void req_bio_endio(struct request *rq, struct bio *bio,
unsigned int nbytes, int error)
{
- struct request_queue *q = rq->q;
+ if (error)
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+ error = -EIO;
- if (&q->flush_rq != rq) {
- if (error)
- clear_bit(BIO_UPTODATE, &bio->bi_flags);
- else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
- error = -EIO;
-
- if (unlikely(nbytes > bio->bi_size)) {
- printk(KERN_ERR "%s: want %u bytes done, %u left\n",
- __func__, nbytes, bio->bi_size);
- nbytes = bio->bi_size;
- }
+ if (unlikely(nbytes > bio->bi_size)) {
+ printk(KERN_ERR "%s: want %u bytes done, %u left\n",
+ __func__, nbytes, bio->bi_size);
+ nbytes = bio->bi_size;
+ }
- if (unlikely(rq->cmd_flags & REQ_QUIET))
- set_bit(BIO_QUIET, &bio->bi_flags);
+ if (unlikely(rq->cmd_flags & REQ_QUIET))
+ set_bit(BIO_QUIET, &bio->bi_flags);
- bio->bi_size -= nbytes;
- bio->bi_sector += (nbytes >> 9);
+ bio->bi_size -= nbytes;
+ bio->bi_sector += (nbytes >> 9);
- if (bio_integrity(bio))
- bio_integrity_advance(bio, nbytes);
+ if (bio_integrity(bio))
+ bio_integrity_advance(bio, nbytes);
- if (bio->bi_size == 0)
- bio_endio(bio, error);
- } else {
- /*
- * Okay, this is the sequenced flush request in
- * progress, just record the error;
- */
- if (error && !q->flush_err)
- q->flush_err = error;
- }
+ /* don't actually finish bio if it's part of flush sequence */
+ if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+ bio_endio(bio, error);
}
void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -540,7 +530,9 @@ struct request_queue *blk_alloc_queue_no
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
- INIT_LIST_HEAD(&q->pending_flushes);
+ INIT_LIST_HEAD(&q->flush_queue[0]);
+ INIT_LIST_HEAD(&q->flush_queue[1]);
+ INIT_LIST_HEAD(&q->flush_data_in_flight);
INIT_WORK(&q->unplug_work, blk_unplug_work);
kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1219,7 +1211,7 @@ static int __make_request(struct request
spin_lock_irq(q->queue_lock);
if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
- where = ELEVATOR_INSERT_FRONT;
+ where = ELEVATOR_INSERT_FLUSH;
goto get_rq;
}
@@ -1804,7 +1796,7 @@ static void blk_account_io_done(struct r
* normal IO on queueing nor completion. Accounting the
* containing request is enough.
*/
- if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
+ if (blk_do_io_stat(req) && !(req->cmd_flags & REQ_FLUSH_SEQ)) {
unsigned long duration = jiffies - req->start_time;
const int rw = rq_data_dir(req);
struct hd_struct *part;
Index: work/block/blk-flush.c
===================================================================
--- work.orig/block/blk-flush.c
+++ work/block/blk-flush.c
@@ -1,6 +1,66 @@
/*
* Functions to sequence FLUSH and FUA writes.
+ *
+ * Copyright (C) 2011 Max Planck Institute for Gravitational Physics
+ * Copyright (C) 2011 Tejun Heo <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * REQ_{FLUSH|FUA} requests are decomposed to sequences composed of three
+ * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
+ * properties and hardware capability.
+ *
+ * If a request doesn't have data, only REQ_FLUSH makes sense, which
+ * indicates a simple flush request. If there is data, REQ_FLUSH indicates
+ * that the device cache should be flushed before the data is executed, and
+ * REQ_FUA means that the data must be on non-volatile media on request
+ * completion.
+ *
+ * If the device doesn't have writeback cache, FLUSH and FUA don't make any
+ * difference. The requests are either completed immediately if there's no
+ * data or executed as normal requests.
+ *
+ * If the device has writeback cache and supports FUA, REQ_FLUSH is
+ * translated to PREFLUSH but REQ_FUA is passed down directly with the data
+ * request.
+ *
+ * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
+ * translated to PREFLUSH and REQ_FUA to POSTFLUSH.
+ *
+ * The actual execution of flush is double buffered. Whenever a request
+ * needs to execute PRE or POSTFLUSH, it queues at
+ * q->flush_queue[q->flush_pending_idx]. Once certain criteria are met,
+ * flush is issued and pending_idx is toggled. When the flush completes,
+ * all the requests which were pending are proceeded to the next step.
+ * This allows near optimal merging of flush requests.
+ *
+ * Currently, the following conditions are used to determine when to issue
+ * flush.
+ *
+ * C1. At any given time, only one flush shall be in progress. This makes
+ * double buffering sufficient.
+ *
+ * C2. Flush is not deferred if any request is executing the data part of
+ * its sequence. This avoids issuing separate POSTFLUSHes for requests
+ * which shared PREFLUSH.
+ *
+ * C3. The second condition is ignored if there is a request which has
+ * waited longer than FLUSH_PENDING_TIMEOUT. This is to avoid
+ * starvation in the unlikely case where there are continuous stream of
+ * FUA (without FLUSH) requests.
+ *
+ * Note that a sequenced FLUSH/FUA request with data is completed twice.
+ * Once while executing the data part and again after the whole sequence is
+ * complete. The first completion updates the contained bio but doesn't
+ * finish it so that the bio submitter is notified only after the whole
+ * sequence is complete. This is implemented by testing REQ_FLUSH_SEQ in
+ * req_bio_endio().
+ *
+ * The above peculiarity requires that each FLUSH/FUA request has only one
+ * bio attached to it, which is guaranteed as they aren't allowed to be
+ * merged in the usual way.
*/
+
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/bio.h>
@@ -11,184 +71,289 @@
/* FLUSH/FUA sequences */
enum {
- QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
- QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_FSEQ_DONE = (1 << 4),
+ REQ_FSEQ_PREFLUSH = (1 << 0), /* pre-flushing in progress */
+ REQ_FSEQ_DATA = (1 << 1), /* data write in progress */
+ REQ_FSEQ_POSTFLUSH = (1 << 2), /* post-flushing in progress */
+ REQ_FSEQ_DONE = (1 << 3),
+
+ REQ_FSEQ_ACTIONS = REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA |
+ REQ_FSEQ_POSTFLUSH,
+
+ /*
+ * If flush has been pending longer than the following timeout,
+ * it's issued even if flush_data requests are still in flight.
+ */
+ FLUSH_PENDING_TIMEOUT = 5 * HZ,
};
-static struct request *queue_next_fseq(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q);
-unsigned blk_flush_cur_seq(struct request_queue *q)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
- if (!q->flush_seq)
- return 0;
- return 1 << ffz(q->flush_seq);
+ unsigned int policy = 0;
+
+ if (fflags & REQ_FLUSH) {
+ if (rq->cmd_flags & REQ_FLUSH)
+ policy |= REQ_FSEQ_PREFLUSH;
+ if (blk_rq_sectors(rq))
+ policy |= REQ_FSEQ_DATA;
+ if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+ policy |= REQ_FSEQ_POSTFLUSH;
+ }
+ return policy;
}
-static struct request *blk_flush_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+static unsigned int blk_flush_cur_seq(struct request *rq)
{
- struct request *next_rq = NULL;
+ return 1 << ffz(rq->flush.seq);
+}
- if (error && !q->flush_err)
- q->flush_err = error;
+static void blk_flush_restore_request(struct request *rq)
+{
+ /*
+ * After flush data completion, @rq->bio is %NULL but we need to
+ * complete the bio again. @rq->biotail is guaranteed to equal the
+ * original @rq->bio. Restore it.
+ */
+ rq->bio = rq->biotail;
- BUG_ON(q->flush_seq & seq);
- q->flush_seq |= seq;
+ /* make @rq a normal request */
+ rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+ rq->end_io = NULL;
+}
- if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
- /* not complete yet, queue the next flush sequence */
- next_rq = queue_next_fseq(q);
- } else {
- /* complete this flush request */
- __blk_end_request_all(q->orig_flush_rq, q->flush_err);
- q->orig_flush_rq = NULL;
- q->flush_seq = 0;
+/**
+ * blk_flush_complete_seq - complete flush sequence
+ * @rq: FLUSH/FUA request being sequenced
+ * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
+ * @error: whether an error occurred
+ *
+ * @rq just completed @seq part of its flush sequence, record the
+ * completion and trigger the next step.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if a request was added to the dispatch queue, %false otherwise.
+ */
+static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
+ int error)
+{
+ struct request_queue *q = rq->q;
+ struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+ bool queued = false;
+
+ BUG_ON(rq->flush.seq & seq);
+ rq->flush.seq |= seq;
+
+ if (likely(!error))
+ seq = blk_flush_cur_seq(rq);
+ else
+ seq = REQ_FSEQ_DONE;
+
+ switch (seq) {
+ case REQ_FSEQ_PREFLUSH:
+ case REQ_FSEQ_POSTFLUSH:
+ /* queue for flush */
+ if (list_empty(pending))
+ q->flush_pending_since = jiffies;
+ list_move_tail(&rq->flush.list, pending);
+ break;
- /* dispatch the next flush if there's one */
- if (!list_empty(&q->pending_flushes)) {
- next_rq = list_entry_rq(q->pending_flushes.next);
- list_move(&next_rq->queuelist, &q->queue_head);
- }
+ case REQ_FSEQ_DATA:
+ list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+ list_add(&rq->queuelist, &q->queue_head);
+ queued = true;
+ break;
+
+ case REQ_FSEQ_DONE:
+ /*
+ * @rq was previously adjusted by blk_flush_issue() for
+ * flush sequencing and may already have gone through the
+ * flush data request completion path. Restore @rq for
+ * normal completion and end it.
+ */
+ BUG_ON(!list_empty(&rq->queuelist));
+ list_del_init(&rq->flush.list);
+ blk_flush_restore_request(rq);
+ __blk_end_request_all(rq, error);
+ break;
+
+ default:
+ BUG();
}
- return next_rq;
+
+ return blk_kick_flush(q) | queued;
}
-static void blk_flush_complete_seq_end_io(struct request_queue *q,
- unsigned seq, int error)
+static void flush_end_io(struct request *flush_rq, int error)
{
+ struct request_queue *q = flush_rq->q;
+ struct list_head *running = &q->flush_queue[q->flush_running_idx];
bool was_empty = elv_queue_empty(q);
- struct request *next_rq;
+ bool queued = false;
+ struct request *rq, *n;
- next_rq = blk_flush_complete_seq(q, seq, error);
+ BUG_ON(q->flush_pending_idx == q->flush_running_idx);
- /*
- * Moving a request silently to empty queue_head may stall the
- * queue. Kick the queue in those cases.
- */
- if (was_empty && next_rq)
+ /* account completion of the flush request */
+ q->flush_running_idx ^= 1;
+ elv_completed_request(q, flush_rq);
+
+ /* and push the waiting requests to the next stage */
+ list_for_each_entry_safe(rq, n, running, flush.list) {
+ unsigned int seq = blk_flush_cur_seq(rq);
+
+ BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+ queued |= blk_flush_complete_seq(rq, seq, error);
+ }
+
+ /* after populating an empty queue, kick it to avoid stall */
+ if (queued && was_empty)
__blk_run_queue(q);
}
-static void pre_flush_end_io(struct request *rq, int error)
+/**
+ * blk_kick_flush - consider issuing flush request
+ * @q: request_queue being kicked
+ *
+ * Flush related states of @q have changed, consider issuing flush request.
+ * Please read the comment at the top of this file for more info.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if flush was issued, %false otherwise.
+ */
+static bool blk_kick_flush(struct request_queue *q)
{
- elv_completed_request(rq->q, rq);
- blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+ struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+ struct request *first_rq =
+ list_first_entry(pending, struct request, flush.list);
+
+ /* C1 described at the top of this file */
+ if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
+ return false;
+
+ /* C2 and C3 */
+ if (!list_empty(&q->flush_data_in_flight) &&
+ time_before(jiffies,
+ q->flush_pending_since + FLUSH_PENDING_TIMEOUT))
+ return false;
+
+ /*
+ * Issue flush and toggle pending_idx. This makes pending_idx
+ * different from running_idx, which means flush is in flight.
+ */
+ blk_rq_init(q, &q->flush_rq);
+ q->flush_rq.cmd_type = REQ_TYPE_FS;
+ q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+ q->flush_rq.rq_disk = first_rq->rq_disk;
+ q->flush_rq.end_io = flush_end_io;
+
+ q->flush_pending_idx ^= 1;
+ elv_insert(q, &q->flush_rq, ELEVATOR_INSERT_FRONT);
+ return true;
}
static void flush_data_end_io(struct request *rq, int error)
{
- elv_completed_request(rq->q, rq);
- blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
-}
+ bool was_empty = elv_queue_empty(rq->q);
-static void post_flush_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+ /* after populating an empty queue, kick it to avoid stall */
+ if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+ __blk_run_queue(rq->q);
}
-static void init_flush_request(struct request *rq, struct gendisk *disk)
+/**
+ * blk_insert_flush - insert a new FLUSH/FUA request
+ * @rq: request to insert
+ *
+ * To be called from elv_insert() for %ELEVATOR_INSERT_FLUSH insertions.
+ * @rq is being submitted. Analyze what needs to be done and put it on the
+ * right queue.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_insert_flush(struct request *rq)
{
- rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = WRITE_FLUSH;
- rq->rq_disk = disk;
-}
+ struct request_queue *q = rq->q;
+ unsigned int fflags = q->flush_flags; /* may change, cache */
+ unsigned int policy = blk_flush_policy(fflags, rq);
-static struct request *queue_next_fseq(struct request_queue *q)
-{
- struct request *orig_rq = q->orig_flush_rq;
- struct request *rq = &q->flush_rq;
+ BUG_ON(rq->end_io);
+ BUG_ON(!rq->bio || rq->bio != rq->biotail);
- blk_rq_init(q, rq);
+ /*
+ * @policy now records what operations need to be done. Adjust
+ * REQ_FLUSH and FUA for the driver.
+ */
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!(fflags & REQ_FUA))
+ rq->cmd_flags &= ~REQ_FUA;
- switch (blk_flush_cur_seq(q)) {
- case QUEUE_FSEQ_PREFLUSH:
- init_flush_request(rq, orig_rq->rq_disk);
- rq->end_io = pre_flush_end_io;
- break;
- case QUEUE_FSEQ_DATA:
- init_request_from_bio(rq, orig_rq->bio);
- /*
- * orig_rq->rq_disk may be different from
- * bio->bi_bdev->bd_disk if orig_rq got here through
- * remapping drivers. Make sure rq->rq_disk points
- * to the same one as orig_rq.
- */
- rq->rq_disk = orig_rq->rq_disk;
- rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
- rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
- rq->end_io = flush_data_end_io;
- break;
- case QUEUE_FSEQ_POSTFLUSH:
- init_flush_request(rq, orig_rq->rq_disk);
- rq->end_io = post_flush_end_io;
- break;
- default:
- BUG();
+ /*
+ * If there's data but flush is not necessary, the request can be
+ * processed directly without going through flush machinery. Queue
+ * for normal execution.
+ */
+ if ((policy & REQ_FSEQ_DATA) &&
+ !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+ list_add(&rq->queuelist, &q->queue_head);
+ return;
}
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- return rq;
+ /*
+ * @rq should go through flush machinery. Mark it part of flush
+ * sequence and submit for further processing.
+ */
+ memset(&rq->flush, 0, sizeof(rq->flush));
+ INIT_LIST_HEAD(&rq->flush.list);
+ rq->cmd_flags |= REQ_FLUSH_SEQ;
+ rq->end_io = flush_data_end_io;
+
+ blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
}
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+/**
+ * blk_abort_flush - @q is being aborted, abort flush requests
+ * @q: request_queue being aborted
+ *
+ * To be called from elv_abort_queue(). @q is being aborted. Prepare all
+ * FLUSH/FUA requests for abortion.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_abort_flushes(struct request_queue *q)
{
- unsigned int fflags = q->flush_flags; /* may change, cache it */
- bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
- bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
- bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
- unsigned skip = 0;
+ struct request *rq, *n;
+ int i;
/*
- * Special case. If there's data but flush is not necessary,
- * the request can be issued directly.
- *
- * Flush w/o data should be able to be issued directly too but
- * currently some drivers assume that rq->bio contains
- * non-zero data if it isn't NULL and empty FLUSH requests
- * getting here usually have bio's without data.
+ * Requests in flight for data are already owned by the dispatch
+ * queue or the device driver. Just restore for normal completion.
*/
- if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
- rq->cmd_flags &= ~REQ_FLUSH;
- if (!has_fua)
- rq->cmd_flags &= ~REQ_FUA;
- return rq;
+ list_for_each_entry_safe(rq, n, &q->flush_data_in_flight, flush.list) {
+ list_del_init(&rq->flush.list);
+ blk_flush_restore_request(rq);
}
/*
- * Sequenced flushes can't be processed in parallel. If
- * another one is already in progress, queue for later
- * processing.
+ * We need to give away requests on flush queues. Restore for
+ * normal completion and put them on the dispatch queue.
*/
- if (q->flush_seq) {
- list_move_tail(&rq->queuelist, &q->pending_flushes);
- return NULL;
+ for (i = 0; i < ARRAY_SIZE(q->flush_queue); i++) {
+ list_for_each_entry_safe(rq, n, &q->flush_queue[i],
+ flush.list) {
+ list_del_init(&rq->flush.list);
+ blk_flush_restore_request(rq);
+ list_add_tail(&rq->queuelist, &q->queue_head);
+ }
}
-
- /*
- * Start a new flush sequence
- */
- q->flush_err = 0;
- q->flush_seq |= QUEUE_FSEQ_STARTED;
-
- /* adjust FLUSH/FUA of the original request and stash it away */
- rq->cmd_flags &= ~REQ_FLUSH;
- if (!has_fua)
- rq->cmd_flags &= ~REQ_FUA;
- blk_dequeue_request(rq);
- q->orig_flush_rq = rq;
-
- /* skip unneded sequences and return the first one */
- if (!do_preflush)
- skip |= QUEUE_FSEQ_PREFLUSH;
- if (!blk_rq_sectors(rq))
- skip |= QUEUE_FSEQ_DATA;
- if (!do_postflush)
- skip |= QUEUE_FSEQ_POSTFLUSH;
- return blk_flush_complete_seq(q, skip, 0);
}
static void bio_end_flush(struct bio *bio, int err)
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,21 +51,17 @@ static inline void blk_clear_rq_complete
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+void blk_abort_flushes(struct request_queue *q);
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- rq == &q->flush_rq)
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q,
q->elevator->ops->elevator_add_req_fn(q, rq);
break;
+ case ELEVATOR_INSERT_FLUSH:
+ rq->cmd_flags |= REQ_SOFTBARRIER;
+ blk_insert_flush(rq);
+ break;
+
default:
printk(KERN_ERR "%s: bad insertion point %d\n",
__func__, where);
@@ -785,6 +790,8 @@ void elv_abort_queue(struct request_queu
{
struct request *rq;
+ blk_abort_flushes(q);
+
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
rq->cmd_flags |= REQ_QUIET;
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struc
#define ELEVATOR_INSERT_BACK 2
#define ELEVATOR_INSERT_SORT 3
#define ELEVATOR_INSERT_REQUEUE 4
+#define ELEVATOR_INSERT_FLUSH 5
/*
* return values from elevator_may_queue_fn
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -103,9 +103,15 @@ struct request {
*/
union {
struct rb_node rb_node; /* sort/lookup */
- void *completion_data;
+ struct {
+ unsigned int serial;
+ unsigned int seq;
+ struct list_head list;
+ } flush;
};
+ void *completion_data;
+
/*
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it.
@@ -363,11 +369,12 @@ struct request_queue
* for flush operations
*/
unsigned int flush_flags;
- unsigned int flush_seq;
- int flush_err;
+ unsigned int flush_pending_idx:1;
+ unsigned int flush_running_idx:1;
+ unsigned long flush_pending_since;
+ struct list_head flush_queue[2];
+ struct list_head flush_data_in_flight;
struct request flush_rq;
- struct request *orig_flush_rq;
- struct list_head pending_flushes;
struct mutex sysfs_lock;
Index: work/include/linux/blk_types.h
===================================================================
--- work.orig/include/linux/blk_types.h
+++ work/include/linux/blk_types.h
@@ -148,6 +148,7 @@ enum rq_flag_bits {
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_COPY_USER, /* contains copies of user pages */
__REQ_FLUSH, /* request for cache flush */
+ __REQ_FLUSH_SEQ, /* request for flush sequence */
__REQ_IO_STAT, /* account I/O stat */
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_SECURE, /* secure discard (used with __REQ_DISCARD) */
@@ -188,6 +189,7 @@ enum rq_flag_bits {
#define REQ_ALLOCED (1 << __REQ_ALLOCED)
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
#define REQ_FLUSH (1 << __REQ_FLUSH)
+#define REQ_FLUSH_SEQ (1 << __REQ_FLUSH_SEQ)
#define REQ_IO_STAT (1 << __REQ_IO_STAT)
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
On Thu, Jan 20, 2011 at 07:50:57PM +0100, Tejun Heo wrote:
> Hello,
>
> On Tue, Jan 18, 2011 at 09:12:55AM +0800, Shaohua Li wrote:
> > This makes sense. would it possible N flush data payloads delay the
> > first request?
>
> I ended up with a different design. Still buggy. It triggers a weird
> oops under stress test but other than that things generally seem to
> work as expected. Please read the comment at the top of blk-flush.c
> for more info. I'll post properly after more testing and debugging.
The oops was caused by debugging code I put in (not posted together).
It runs fine without it, so if you have some time, please give it a
spin.
Thanks.
--
tejun
On Thu, Jan 20, 2011 at 08:13:41PM +0100, Tejun Heo wrote:
> On Thu, Jan 20, 2011 at 07:50:57PM +0100, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Jan 18, 2011 at 09:12:55AM +0800, Shaohua Li wrote:
> > > This makes sense. would it possible N flush data payloads delay the
> > > first request?
> >
> > I ended up with a different design. Still buggy. It triggers a weird
> > oops under stress test but other than that things generally seem to
> > work as expected. Please read the comment at the top of blk-flush.c
> > for more info. I'll post properly after more testing and debugging.
>
> The oops was caused by debugging code I put in (not posted together).
> It runs fine without it, so if you have some time, please give it a
> spin.
Seems to run without incident on my storage test arrays. The fsync-happy
performance numbers are comparable to previous iterations of flush patches.
Looks good, thank you!
--D