2010-11-29 22:05:46

by djwong

[permalink] [raw]
Subject: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time. Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure. When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, there's now a flag to
indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a flush.

So, if someone calls ext4_sync_file and no flushes are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are any
other threads that are close behind in ext4_sync_file. After that wait, the
state transitions to 2 and the flush is issued. Once that's done, the state
goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a flush themselves, they simply
wait for that first thread to do it for them. If they see that the flag is 2,
they wait for the current flush to finish, and start over.

However, there are a couple of exceptions to this rule. First, there exist
high-end storage arrays with battery-backed write caches for which flush
commands take very little time (< 2ms); on these systems, performing the
coordination actually lowers performance. Given the earlier patch to the block
layer to report low-level device flush times, we can detect this situation and
have all threads issue flushes without coordinating, as we did before. The
second case is when there's a single thread issuing flushes, in which case it
can skip the coordination.

This author of this patch is aware that jbd2 has a similar flush coordination
scheme for journal commits. An earlier version of this patch simply created a
new empty journal transaction and committed it, but that approach was shown to
increase the amount of write traffic heading towards the disk, which in turn
lowered performance considerably, especially in the case where directio was in
use. Therefore, this patch adds the coordination code directly to ext4.

To test the performance and safety of this patchset, I crafted an ffsb profile
named fsync-happy that performs a bunch of disk I/O with periodic fsync()s to
flush the data out to disk. Performance results can be seen here:
http://bit.ly/fYAclV

The data presented in blue text represent results obtained on high performance
disk arrays that have battery-backed write cache enabled. Red results on the
"speed differences" page represent performance regressions, of course.
Descriptions of the disk hardware tested are on the rightmost page. In no case
were any of the benchmarks CPU-bound.

The speed differences page shows some interesting results. Before Tejun Heo's
barrier -> flush conversion in 2.6.37-rc1, we saw that enabling barriers caused
between a 30-80 percent performance regression on a fairly large variety of
test programs; generally, the more fsyncs, the bigger the drop; if one never
fsyncs any data, the only flushes that ever happen are during the periodic
journal commits. Now we see that the cost of enabling flushes in ext4 on the
fsync-happy workload has dropped from about 80 percent to about 25-30 percent.
With this fsync coordination patch, that drop becomes about 5-14 percent.

I see some small performance (< 1 percent) regressions for some hardware. This is
generally acceptable because I see larger variances from repeatedly running
fsync-happy. The two larger regressions (elm3a4_ipr_nowc and
elm3c44_sata_nowc) are a somewhat questionable case because those two disks
have no write cache yet ext4 was not properly detecting this and setting
barrier=0. That bug will be addressed separately.

In terms of data safety, I've been performing power failure testing with a
bunch of blades that have slow IDE disks with fairly large write caches. So
far I haven't seen any more FS errors after reset than I see with 2.6.36.

This patchset consists of four patches. The first adds to the block layer the
ability to measure the amount of time it takes for a lower-level block device
to issue and complete a flush command. The middle two patches add to md and
dm, respectively, the ability to report the component devices' flush times.
For 2.6.37-rc3, the md patch also requires my earlier patch to md to enable
REQ_FLUSH support. The fourth patch adds the auto-tuning fsync flush
coordination to ext4.

To everyone who has reviewed this patch set so far, thank you for your help!
As usual, I welcome any questions or comments.

--D


2010-11-29 22:05:55

by djwong

[permalink] [raw]
Subject: [PATCH 1/4] block: Measure flush round-trip times and report average value

This patch adds to the block layer the ability to intercept flush requests for
the purpose of measuring the round-trip times of the write-cache flush command
on whatever hardware sits underneath the block layer. The eventual intent of
this patch is for higher-level clients (filesystems) to be able to measure
flush times to tune their use of them.

Signed-off-by: Darrick J. Wong <[email protected]>
---
block/blk-core.c | 13 +++++++++++++
block/genhd.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/bio.c | 11 +++++++++++
include/linux/blk_types.h | 2 ++
include/linux/blkdev.h | 2 ++
include/linux/genhd.h | 23 +++++++++++++++++++++++
6 files changed, 90 insertions(+), 0 deletions(-)


diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..233573a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1540,6 +1540,9 @@ static inline void __generic_make_request(struct bio *bio)

trace_block_bio_queue(q, bio);

+ if (bio->bi_rw & REQ_FLUSH)
+ bio->flush_start_time_ns = ktime_to_ns(ktime_get());
+
ret = q->make_request_fn(q, bio);
} while (ret);

@@ -1954,6 +1957,9 @@ void blk_start_request(struct request *req)
req->next_rq->resid_len = blk_rq_bytes(req->next_rq);

blk_add_timer(req);
+
+ if (req->cmd_flags & REQ_FLUSH)
+ req->flush_start_time_ns = ktime_to_ns(ktime_get());
}
EXPORT_SYMBOL(blk_start_request);

@@ -2182,6 +2188,13 @@ EXPORT_SYMBOL_GPL(blk_unprep_request);
*/
static void blk_finish_request(struct request *req, int error)
{
+ if (!error && req->cmd_flags & REQ_FLUSH) {
+ u64 delta;
+
+ delta = ktime_to_ns(ktime_get()) - req->flush_start_time_ns;
+ update_flush_times(req->rq_disk, delta);
+ }
+
if (blk_rq_tagged(req))
blk_queue_end_tag(req->q, req);

diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..465bd00 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -538,6 +538,9 @@ void add_disk(struct gendisk *disk)
*/
disk->major = MAJOR(devt);
disk->first_minor = MINOR(devt);
+ spin_lock_init(&disk->flush_time_lock);
+ disk->avg_flush_time_ns = 0;
+ disk->flush_samples = 0;

/* Register BDI before referencing it from bdev */
bdi = &disk->queue->backing_dev_info;
@@ -824,6 +827,37 @@ static ssize_t disk_range_show(struct device *dev,
return sprintf(buf, "%d\n", disk->minors);
}

+static ssize_t disk_flush_samples_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return sprintf(buf, "%llu\n", disk->flush_samples);
+}
+
+static ssize_t disk_avg_flush_time_ns_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return sprintf(buf, "%llu\n", disk->avg_flush_time_ns);
+}
+
+static ssize_t disk_avg_flush_time_ns_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ spin_lock(&disk->flush_time_lock);
+ disk->avg_flush_time_ns = 0;
+ disk->flush_samples = 0;
+ spin_unlock(&disk->flush_time_lock);
+
+ return count;
+}
+
+
static ssize_t disk_ext_range_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -876,6 +910,9 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
}

static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
+static DEVICE_ATTR(avg_flush_time_ns, S_IRUGO | S_IWUSR,
+ disk_avg_flush_time_ns_show, disk_avg_flush_time_ns_store);
+static DEVICE_ATTR(flush_samples, S_IRUGO, disk_flush_samples_show, NULL);
static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
static DEVICE_ATTR(ro, S_IRUGO, disk_ro_show, NULL);
@@ -898,6 +935,8 @@ static struct device_attribute dev_attr_fail_timeout =

static struct attribute *disk_attrs[] = {
&dev_attr_range.attr,
+ &dev_attr_avg_flush_time_ns.attr,
+ &dev_attr_flush_samples.attr,
&dev_attr_ext_range.attr,
&dev_attr_removable.attr,
&dev_attr_ro.attr,
diff --git a/fs/bio.c b/fs/bio.c
index 4bd454f..aab5f27 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1442,11 +1442,22 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
**/
void bio_endio(struct bio *bio, int error)
{
+ struct request_queue *q = NULL;
+
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;

+ if (bio->bi_bdev)
+ q = bdev_get_queue(bio->bi_bdev);
+ if (!(q && q->prep_rq_fn) && !error && bio->bi_rw & REQ_FLUSH) {
+ u64 delta;
+
+ delta = ktime_to_ns(ktime_get()) - bio->flush_start_time_ns;
+ update_flush_times(bio->bi_bdev->bd_disk, delta);
+ }
+
if (bio->bi_end_io)
bio->bi_end_io(bio, error);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..e8c29b9 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,8 @@ struct bio {

bio_destructor_t *bi_destructor; /* destructor */

+ u64 flush_start_time_ns;
+
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aae86fd..357d669 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -163,6 +163,8 @@ struct request {

/* for bidi */
struct request *next_rq;
+
+ u64 flush_start_time_ns;
};

static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7a7b9c1..7097396 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -178,8 +178,31 @@ struct gendisk {
struct blk_integrity *integrity;
#endif
int node_id;
+
+ spinlock_t flush_time_lock;
+ u64 avg_flush_time_ns;
+ u64 flush_samples;
};

+static inline void update_flush_times(struct gendisk *disk, u64 delta)
+{
+ u64 data;
+
+ spin_lock(&disk->flush_time_lock);
+ if (disk->flush_samples < 256)
+ disk->flush_samples++;
+ if (!disk->avg_flush_time_ns)
+ disk->avg_flush_time_ns = delta;
+ else {
+ data = disk->avg_flush_time_ns;
+ data *= 255;
+ data += delta;
+ data /= 256;
+ disk->avg_flush_time_ns = data;
+ }
+ spin_unlock(&disk->flush_time_lock);
+}
+
static inline struct gendisk *part_to_disk(struct hd_struct *part)
{
if (likely(part)) {

2010-11-29 22:06:13

by djwong

[permalink] [raw]
Subject: [PATCH 3/4] dm: Compute average flush time from component devices

For dm devices which are composed of other block devices, a flush is mapped out
to those other block devices. Therefore, the average flush time can be
computed as the average flush time of whichever device flushes most slowly.

Signed-off-by: Darrick J. Wong <[email protected]>
---
drivers/md/dm.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)


diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7cb1352..62aeeb9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -846,12 +846,38 @@ static void start_queue(struct request_queue *q)
spin_unlock_irqrestore(q->queue_lock, flags);
}

+static void measure_flushes(struct mapped_device *md)
+{
+ struct dm_table *t;
+ struct dm_dev_internal *dd;
+ struct list_head *devices;
+ u64 max = 0, samples = 0;
+
+ t = dm_get_live_table(md);
+ devices = dm_table_get_devices(t);
+ list_for_each_entry(dd, devices, list) {
+ if (dd->dm_dev.bdev->bd_disk->avg_flush_time_ns <= max)
+ continue;
+ max = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns;
+ samples = dd->dm_dev.bdev->bd_disk->flush_samples;
+ }
+ dm_table_put(t);
+
+ spin_lock(&md->disk->flush_time_lock);
+ md->disk->avg_flush_time_ns = max;
+ md->disk->flush_samples = samples;
+ spin_unlock(&md->disk->flush_time_lock);
+}
+
static void dm_done(struct request *clone, int error, bool mapped)
{
int r = error;
struct dm_rq_target_io *tio = clone->end_io_data;
dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;

+ if (clone->cmd_flags & REQ_FLUSH)
+ measure_flushes(tio->md);
+
if (mapped && rq_end_io)
r = rq_end_io(tio->ti, clone, error, &tio->info);

@@ -2310,6 +2336,8 @@ static void dm_wq_work(struct work_struct *work)
if (dm_request_based(md))
generic_make_request(c);
else
+ if (c->bi_rw & REQ_FLUSH)
+ measure_flushes(md);
__split_and_process_bio(md, c);

down_read(&md->io_lock);

2010-11-29 22:06:26

by djwong

[permalink] [raw]
Subject: [PATCH 4/4] ext4: Coordinate data-only flush requests sent by fsync

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time. Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure. When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, there's now a flag to
indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a flush.

So, if someone calls ext4_sync_file and no flushes are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are any
other threads that are close behind in ext4_sync_file. After that wait, the
state transitions to 2 and the flush is issued. Once that's done, the state
goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a flush themselves, they simply
wait for that first thread to do it for them. If they see that the flag is 2,
they wait for the current flush to finish, and start over.

However, there are a couple of exceptions to this rule. First, there exist
high-end storage arrays with battery-backed write caches for which flush
commands take very little time (< 2ms); on these systems, performing the
coordination actually lowers performance. Given the earlier patch to the block
layer to report low-level device flush times, we can detect this situation and
have all threads issue flushes without coordinating, as we did before. The
second case is when there's a single thread issuing flushes, in which case it
can skip the coordination.

This author of this patch is aware that jbd2 has a similar flush coordination
scheme for journal commits. An earlier version of this patch simply created a
new empty journal transaction and committed it, but that approach was shown to
increase the amount of write traffic heading towards the disk, which in turn
lowered performance considerably, especially in the case where directio was in
use. Therefore, this patch adds the coordination code directly to ext4.

Should the user need to override the definition of a "fast" flush from the
default 2ms, the fast_flush_ns mount option is provided to do this.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 18 +++++++++++++
fs/ext4/fsync.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/super.c | 23 ++++++++++++++++
3 files changed, 119 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6a5edea..8c111e3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -38,6 +38,17 @@
*/

/*
+ * Flushes under 2ms should disable flush coordination
+ */
+#define DEFAULT_FAST_FLUSH 2000000
+
+enum ext4_flush_state {
+ EXT4_FLUSH_IDLE = 0, /* no flushes going on */
+ EXT4_FLUSH_WAITING, /* coordinating w/ other threads */
+ EXT4_FLUSH_RUNNING, /* flush submitted */
+};
+
+/*
* Define EXT4FS_DEBUG to produce debug messages
*/
#undef EXT4FS_DEBUG
@@ -1198,6 +1209,13 @@ struct ext4_sb_info {
struct ext4_li_request *s_li_request;
/* Wait multiplier for lazy initialization thread */
unsigned int s_li_wait_mult;
+
+ /* fsync flush coordination */
+ spinlock_t flush_flag_lock;
+ enum ext4_flush_state flush_state;
+ struct completion flush_finish;
+ pid_t last_flusher;
+ unsigned long fast_flush_ns;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index c1a7bc9..e3e5a5f 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -141,6 +141,83 @@ static void ext4_sync_parent(struct inode *inode)
}

/*
+ * Handle the case where a process wants to flush writes to disk but there is
+ * no accompanying journal commit (i.e. no metadata to be updated). This can
+ * happen when a first thread writes data, some other threads issue and commit
+ * transactions for other filesystem activity, and then the first writer thread
+ * issues an fsync to flush its dirty data to disk.
+ */
+static int ext4_sync_dataonly(struct inode *inode)
+{
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ struct gendisk *disk;
+ ktime_t expires;
+ pid_t pid;
+ int ret = 0;
+
+ /*
+ * Fast (< 2ms) flushes imply battery-backed write cache or a block
+ * device that silently eat flushes (disk w/o any write cache), which
+ * implies that flushes are no-ops. We also check the calling process;
+ * if it's the same as the previous caller, there's only one process,
+ * and no need to coordinate. Issue the flush instead of wasting time
+ * coordinating no-ops.
+ *
+ * As this is a data-only flush (no metadata writes), we do the flush
+ * coordination here instead of creating and committing an empty
+ * journal transaction, because doing so creates more writes for the
+ * empty journal records.
+ */
+ pid = current->pid;
+ disk = inode->i_sb->s_bdev->bd_disk;
+ spin_lock(&sb->flush_flag_lock);
+ if ((!sb->flush_state && sb->last_flusher == pid) ||
+ sb->fast_flush_ns > disk->avg_flush_time_ns) {
+ sb->last_flusher = pid;
+ spin_unlock(&sb->flush_flag_lock);
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
+ NULL);
+ return 0;
+ }
+again:
+ switch (sb->flush_state) {
+ case EXT4_FLUSH_RUNNING:
+ spin_unlock(&sb->flush_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->flush_finish);
+ spin_lock(&sb->flush_flag_lock);
+ goto again;
+ case EXT4_FLUSH_WAITING:
+ spin_unlock(&sb->flush_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->flush_finish);
+ break;
+ case EXT4_FLUSH_IDLE:
+ sb->last_flusher = pid;
+ sb->flush_state = EXT4_FLUSH_WAITING;
+ INIT_COMPLETION(sb->flush_finish);
+ spin_unlock(&sb->flush_flag_lock);
+
+ expires = ktime_add_ns(ktime_get(), disk->avg_flush_time_ns);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+ spin_lock(&sb->flush_flag_lock);
+ sb->flush_state = EXT4_FLUSH_RUNNING;
+ spin_unlock(&sb->flush_flag_lock);
+
+ ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
+
+ complete_all(&sb->flush_finish);
+
+ spin_lock(&sb->flush_flag_lock);
+ sb->flush_state = EXT4_FLUSH_IDLE;
+ spin_unlock(&sb->flush_flag_lock);
+ break;
+ }
+
+ return ret;
+}
+
+/*
* akpm: A new design for ext4_sync_file().
*
* This is only called from sys_fsync(), sys_fdatasync() and sys_msync().
@@ -214,6 +291,6 @@ int ext4_sync_file(struct file *file, int datasync)
NULL);
ret = jbd2_log_wait_commit(journal, commit_tid);
} else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
+ ret = ext4_sync_dataonly(inode);
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e32195d..473721a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1026,6 +1026,10 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
!(def_mount_opts & EXT4_DEFM_NODELALLOC))
seq_puts(seq, ",nodelalloc");

+ if (sbi->fast_flush_ns != DEFAULT_FAST_FLUSH)
+ seq_printf(seq, ",fast_flush_ns=%lu",
+ sbi->fast_flush_ns);
+
if (sbi->s_stripe)
seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
/*
@@ -1245,6 +1249,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard,
Opt_init_inode_table, Opt_noinit_inode_table,
+ Opt_fast_flush_ns,
};

static const match_table_t tokens = {
@@ -1318,6 +1323,7 @@ static const match_table_t tokens = {
{Opt_init_inode_table, "init_itable=%u"},
{Opt_init_inode_table, "init_itable"},
{Opt_noinit_inode_table, "noinit_itable"},
+ {Opt_fast_flush_ns, "fast_flush_ns=%d"},
{Opt_err, NULL},
};

@@ -1802,6 +1808,15 @@ set_qf_format:
case Opt_noinit_inode_table:
clear_opt(sbi->s_mount_opt, INIT_INODE_TABLE);
break;
+ case Opt_fast_flush_ns:
+ if (args[0].from) {
+ if (match_int(&args[0], &option))
+ return 0;
+ } else
+ return 0;
+
+ sbi->fast_flush_ns = option;
+ break;
default:
ext4_msg(sb, KERN_ERR,
"Unrecognized mount option \"%s\" "
@@ -3120,6 +3135,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"failed to parse options in superblock: %s",
sbi->s_es->s_mount_opts);
}
+
+ EXT4_SB(sb)->fast_flush_ns = DEFAULT_FAST_FLUSH;
+
if (!parse_options((char *) data, sb, &journal_devnum,
&journal_ioprio, NULL, 0))
goto failed_mount;
@@ -3617,6 +3635,11 @@ no_journal:
if (es->s_error_count)
mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5 minutes */

+ EXT4_SB(sb)->flush_state = EXT4_FLUSH_IDLE;
+ spin_lock_init(&EXT4_SB(sb)->flush_flag_lock);
+ init_completion(&EXT4_SB(sb)->flush_finish);
+ EXT4_SB(sb)->last_flusher = 0;
+
kfree(orig_data);
return 0;

2010-11-29 22:06:42

by djwong

[permalink] [raw]
Subject: [PATCH 2/4] md: Compute average flush time from component devices

For md devices which are composed of other block devices, a flush is spread out
to those other block devices. Therefore, the average flush time can be
computed as the average flush time of whichever device flushes most slowly.

Signed-off-by: Darrick J. Wong <[email protected]>
---
drivers/md/md.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 43243a4..af25c96 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -357,6 +357,28 @@ EXPORT_SYMBOL(mddev_congested);
* Generic flush handling for md
*/

+static void measure_flushes(mddev_t *mddev)
+{
+ mdk_rdev_t *rdev;
+ u64 max = 0, samples = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(Faulty, &rdev->flags)) {
+ if (rdev->bdev->bd_disk->avg_flush_time_ns <= max)
+ continue;
+ max = rdev->bdev->bd_disk->avg_flush_time_ns;
+ samples = rdev->bdev->bd_disk->flush_samples;
+ }
+ rcu_read_unlock();
+
+ spin_lock(&mddev->gendisk->flush_time_lock);
+ mddev->gendisk->avg_flush_time_ns = max;
+ mddev->gendisk->flush_samples = samples;
+ spin_unlock(&mddev->gendisk->flush_time_lock);
+}
+
static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
@@ -365,6 +387,7 @@ static void md_end_flush(struct bio *bio, int err)
rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
+ measure_flushes(mddev);
/* The pre-request flush has finished */
queue_work(md_wq, &mddev->flush_work);
}

2010-11-29 23:47:18

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On 11/29/2010 05:05 PM, Darrick J. Wong wrote:
> On certain types of hardware, issuing a write cache flush takes a considerable
> amount of time. Typically, these are simple storage systems with write cache
> enabled and no battery to save that cache after a power failure. When we
> encounter a system with many I/O threads that write data and then call fsync
> after more transactions accumulate, ext4_sync_file performs a data-only flush,
> the performance of which is suboptimal because each of those threads issues its
> own flush command to the drive instead of trying to coordinate the flush,
> thereby wasting execution time.
>
> Instead of each fsync call initiating its own flush, there's now a flag to
> indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> collect other fsync threads, or (2) we're actually in-progress on a flush.
>
> So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> shifts from 0->1 and the thread delays for a short time to see if there are any
> other threads that are close behind in ext4_sync_file. After that wait, the
> state transitions to 2 and the flush is issued. Once that's done, the state
> goes back to 0 and a completion is signalled.
>
> Those close-behind threads see the flag is already 1, and go to sleep until the
> completion is signalled. Instead of issuing a flush themselves, they simply
> wait for that first thread to do it for them. If they see that the flag is 2,
> they wait for the current flush to finish, and start over.
>
> However, there are a couple of exceptions to this rule. First, there exist
> high-end storage arrays with battery-backed write caches for which flush
> commands take very little time (< 2ms); on these systems, performing the
> coordination actually lowers performance. Given the earlier patch to the block
> layer to report low-level device flush times, we can detect this situation and
> have all threads issue flushes without coordinating, as we did before. The
> second case is when there's a single thread issuing flushes, in which case it
> can skip the coordination.
>
> This author of this patch is aware that jbd2 has a similar flush coordination
> scheme for journal commits. An earlier version of this patch simply created a
> new empty journal transaction and committed it, but that approach was shown to
> increase the amount of write traffic heading towards the disk, which in turn
> lowered performance considerably, especially in the case where directio was in
> use. Therefore, this patch adds the coordination code directly to ext4.

Hi Darrick,

Just curious why we would need to have batching in both places? Doesn't your
patch set make the jbd2 transaction batching redundant?

I noticed that the patches have a default delay and a mount option to override
that default. The jbd2 code today tries to measure the average time needed in a
transaction and automatically tune itself. Can't we do something similar with
your patch set? (I hate to see yet another mount option added!)

Regards,

Ric

2010-11-30 00:19:47

by djwong

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Mon, Nov 29, 2010 at 06:48:25PM -0500, Ric Wheeler wrote:
> On 11/29/2010 05:05 PM, Darrick J. Wong wrote:
>> On certain types of hardware, issuing a write cache flush takes a considerable
>> amount of time. Typically, these are simple storage systems with write cache
<snip>
>> lowered performance considerably, especially in the case where directio was in
>> use. Therefore, this patch adds the coordination code directly to ext4.
>
> Hi Darrick,
>
> Just curious why we would need to have batching in both places? Doesn't
> your patch set make the jbd2 transaction batching redundant?

The code path that I'm changing is only executed when ext4_sync_file determines
that the flush can't go through the journal, i.e. whenever the previous
sequence of data writes hasn't resulted in any metadata updates, or if the
transaction that went with the previous writes has already been committed.

> I noticed that the patches have a default delay and a mount option to
> override that default. The jbd2 code today tries to measure the average
> time needed in a transaction and automatically tune itself. Can't we do
> something similar with your patch set? (I hate to see yet another mount
> option added!)

The mount option is no longer the delay time, as it was in previous patches.
In the (unreleased) v5 patch, the code automatically tuned the delay based on
the average flush time. However, we then observed very low flush times (< 2ms)
and about a 6% regression on our arrays with battery-backed write cache, so the
auto-tune code was then adapted in v6 to skip the coordination if the average
flush time falls below that threshold, as it does on our arrays.

Therefore, the new mount option exists to override the default threshold.

--D

2010-11-30 00:39:25

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong" <[email protected]>
wrote:

> On certain types of hardware, issuing a write cache flush takes a considerable
> amount of time. Typically, these are simple storage systems with write cache
> enabled and no battery to save that cache after a power failure. When we
> encounter a system with many I/O threads that write data and then call fsync
> after more transactions accumulate, ext4_sync_file performs a data-only flush,
> the performance of which is suboptimal because each of those threads issues its
> own flush command to the drive instead of trying to coordinate the flush,
> thereby wasting execution time.
>
> Instead of each fsync call initiating its own flush, there's now a flag to
> indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> collect other fsync threads, or (2) we're actually in-progress on a flush.
>
> So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> shifts from 0->1 and the thread delays for a short time to see if there are any
> other threads that are close behind in ext4_sync_file. After that wait, the
> state transitions to 2 and the flush is issued. Once that's done, the state
> goes back to 0 and a completion is signalled.

I haven't seen any of the preceding discussion do I might be missing
something important, but this seems needlessly complex and intrusive.
In particular, I don't like adding code to md to propagate these timings up
to the fs, and I don't the arbitrary '2ms' number.

Would it not be sufficient to simply gather flushes while a flush is pending.
i.e
- if no flush is pending, set the 'flush pending' flag, submit a flush,
then clear the flag.
- if a flush is pending, then wait for it to complete, and then submit a
single flush on behalf of all pending flushes.

That way when flush is fast, you do a flush every time, and when it is slow
you gather multiple flushes together.
I think it would issues a few more flushes than your scheme, but it would be
a much neater solution. Have you tried that and found it to be insufficient?

Thanks,
NeilBrown


2010-11-30 00:47:17

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On 11/29/2010 07:39 PM, Neil Brown wrote:
> On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong"<[email protected]>
> wrote:
>
>> On certain types of hardware, issuing a write cache flush takes a considerable
>> amount of time. Typically, these are simple storage systems with write cache
>> enabled and no battery to save that cache after a power failure. When we
>> encounter a system with many I/O threads that write data and then call fsync
>> after more transactions accumulate, ext4_sync_file performs a data-only flush,
>> the performance of which is suboptimal because each of those threads issues its
>> own flush command to the drive instead of trying to coordinate the flush,
>> thereby wasting execution time.
>>
>> Instead of each fsync call initiating its own flush, there's now a flag to
>> indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
>> collect other fsync threads, or (2) we're actually in-progress on a flush.
>>
>> So, if someone calls ext4_sync_file and no flushes are in progress, the flag
>> shifts from 0->1 and the thread delays for a short time to see if there are any
>> other threads that are close behind in ext4_sync_file. After that wait, the
>> state transitions to 2 and the flush is issued. Once that's done, the state
>> goes back to 0 and a completion is signalled.
> I haven't seen any of the preceding discussion do I might be missing
> something important, but this seems needlessly complex and intrusive.
> In particular, I don't like adding code to md to propagate these timings up
> to the fs, and I don't the arbitrary '2ms' number.
>
> Would it not be sufficient to simply gather flushes while a flush is pending.
> i.e
> - if no flush is pending, set the 'flush pending' flag, submit a flush,
> then clear the flag.
> - if a flush is pending, then wait for it to complete, and then submit a
> single flush on behalf of all pending flushes.
>
> That way when flush is fast, you do a flush every time, and when it is slow
> you gather multiple flushes together.
> I think it would issues a few more flushes than your scheme, but it would be
> a much neater solution. Have you tried that and found it to be insufficient?
>
> Thanks,
> NeilBrown
>

The problem with that is that you can introduce a wait for the next flush longer
than it would take to complete the flush. Having the wait adjust itself
according to the speed of the device is much better I think....

Ric

2010-11-30 01:26:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Mon, 29 Nov 2010 19:48:52 -0500 Ric Wheeler <[email protected]> wrote:

> On 11/29/2010 07:39 PM, Neil Brown wrote:
> > On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong"<[email protected]>
> > wrote:
> >
> >> On certain types of hardware, issuing a write cache flush takes a considerable
> >> amount of time. Typically, these are simple storage systems with write cache
> >> enabled and no battery to save that cache after a power failure. When we
> >> encounter a system with many I/O threads that write data and then call fsync
> >> after more transactions accumulate, ext4_sync_file performs a data-only flush,
> >> the performance of which is suboptimal because each of those threads issues its
> >> own flush command to the drive instead of trying to coordinate the flush,
> >> thereby wasting execution time.
> >>
> >> Instead of each fsync call initiating its own flush, there's now a flag to
> >> indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> >> collect other fsync threads, or (2) we're actually in-progress on a flush.
> >>
> >> So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> >> shifts from 0->1 and the thread delays for a short time to see if there are any
> >> other threads that are close behind in ext4_sync_file. After that wait, the
> >> state transitions to 2 and the flush is issued. Once that's done, the state
> >> goes back to 0 and a completion is signalled.
> > I haven't seen any of the preceding discussion do I might be missing
> > something important, but this seems needlessly complex and intrusive.
> > In particular, I don't like adding code to md to propagate these timings up
> > to the fs, and I don't the arbitrary '2ms' number.
> >
> > Would it not be sufficient to simply gather flushes while a flush is pending.
> > i.e
> > - if no flush is pending, set the 'flush pending' flag, submit a flush,
> > then clear the flag.
> > - if a flush is pending, then wait for it to complete, and then submit a
> > single flush on behalf of all pending flushes.
> >
> > That way when flush is fast, you do a flush every time, and when it is slow
> > you gather multiple flushes together.
> > I think it would issues a few more flushes than your scheme, but it would be
> > a much neater solution. Have you tried that and found it to be insufficient?
> >
> > Thanks,
> > NeilBrown
> >
>
> The problem with that is that you can introduce a wait for the next flush longer
> than it would take to complete the flush. Having the wait adjust itself
> according to the speed of the device is much better I think....
>

Hi Ric,

You certainly can introduce a wait longer than the flush would take, but
the proposed code does that too.

The proposed code waits "average flush time", then submits a flush for
all threads that are waiting.
My suggestion submits a flush (Which on average takes 'average flush time')
and then submits a flush for all threads that started waiting during that
time.

So we do get two flushes rather than one, but also the flush starts
straight away, so will presumably finish sooner.

I do have the wait 'adjust itself according to the speed of the device' by
making flushes wait for one flush to complete.


I'm guessing that the situation where this is an issue is when you have a
nearly constant stream of flush requests - is that right?

In that case:
- the current code would submit lots of over-lapping flush requests,
reducing the opportunity for optimising multiple requests in the
device,
- my proposal would submit a steady sequence of flushes with minimal
gaps
- the code from Darrick would submit a steady sequence of flush requests
which would be separated by pauses of 'average flush time'.
This would increase latency, but might increase throughput too, I
don't know.

So it seems to me that the performance differences between my suggesting
and Darrick's proposal are not obvious. So some testing would be
interesting.

I was going to suggest changes to Darrick's code to demonstrate my idea, but
I think that code is actually wrong, so it wouldn't be a good base to start.

In particular, the usage of a continuation seems racy.
As soon as one thread sets EXT4_FLUSH_IDLE, another thread could call
INIT_COMPLETION, before some other threads has had a chance to wake up and
test the completion status in wait_for_completion.
The effect is that the other thread would wait for an extra time+flush which
it shouldn't have to. So it isn't a serious race, but it looks wrong.

NeilBrown

2010-11-30 05:21:44

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 3/4] dm: Compute average flush time from component devices

On Mon, Nov 29 2010 at 5:05pm -0500,
Darrick J. Wong <[email protected]> wrote:

> For dm devices which are composed of other block devices, a flush is mapped out
> to those other block devices. Therefore, the average flush time can be
> computed as the average flush time of whichever device flushes most slowly.

I share Neil's concern about having to track such fine grained
additional state in order to make the FS behave somewhat better. What
are the _real_ fsync-happy workloads which warrant this optimization?

That concern aside, my comments on your proposed DM changes are inlined below.

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 7cb1352..62aeeb9 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -846,12 +846,38 @@ static void start_queue(struct request_queue *q)
> spin_unlock_irqrestore(q->queue_lock, flags);
> }
>
> +static void measure_flushes(struct mapped_device *md)
> +{
> + struct dm_table *t;
> + struct dm_dev_internal *dd;
> + struct list_head *devices;
> + u64 max = 0, samples = 0;
> +
> + t = dm_get_live_table(md);
> + devices = dm_table_get_devices(t);
> + list_for_each_entry(dd, devices, list) {
> + if (dd->dm_dev.bdev->bd_disk->avg_flush_time_ns <= max)
> + continue;
> + max = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns;
> + samples = dd->dm_dev.bdev->bd_disk->flush_samples;
> + }
> + dm_table_put(t);
> +
> + spin_lock(&md->disk->flush_time_lock);
> + md->disk->avg_flush_time_ns = max;
> + md->disk->flush_samples = samples;
> + spin_unlock(&md->disk->flush_time_lock);
> +}
> +

You're checking all devices in a table rather than all devices that will
receive a flush. The devices that will receive a flush is left for each
target to determine (target exposes num_flush_requests). I'd prefer to
see a more controlled .iterate_devices() based iteration of devices in
each target.

dm-table.c:dm_calculate_queue_limits() shows how iterate_devices can be
used to combine device specific data using a common callback and a data
pointer -- for that data pointer we'd need a local temporary structure
with your 'max' and 'samples' members.

> static void dm_done(struct request *clone, int error, bool mapped)
> {
> int r = error;
> struct dm_rq_target_io *tio = clone->end_io_data;
> dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>
> + if (clone->cmd_flags & REQ_FLUSH)
> + measure_flushes(tio->md);
> +
> if (mapped && rq_end_io)
> r = rq_end_io(tio->ti, clone, error, &tio->info);
>
> @@ -2310,6 +2336,8 @@ static void dm_wq_work(struct work_struct *work)
> if (dm_request_based(md))
> generic_make_request(c);
> else
> + if (c->bi_rw & REQ_FLUSH)
> + measure_flushes(md);
> __split_and_process_bio(md, c);
>
> down_read(&md->io_lock);
>

You're missing important curly braces for the else in your dm_wq_work()
change...

But the bio-based call to measure_flushes() (dm_wq_work's call) should
be pushed into __split_and_process_bio() -- and maybe measure_flushes()
could grow a 'struct dm_table *table' argument that, if not NULL, avoids
getting the reference that __split_and_process_bio() already has on the
live table.

Mike

2010-11-30 13:45:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

Hello,

On 11/30/2010 01:39 AM, Neil Brown wrote:
> I haven't seen any of the preceding discussion do I might be missing
> something important, but this seems needlessly complex and intrusive.
> In particular, I don't like adding code to md to propagate these timings up
> to the fs, and I don't the arbitrary '2ms' number.
>
> Would it not be sufficient to simply gather flushes while a flush is pending.
> i.e
> - if no flush is pending, set the 'flush pending' flag, submit a flush,
> then clear the flag.
> - if a flush is pending, then wait for it to complete, and then submit a
> single flush on behalf of all pending flushes.

Heh, I was about to suggest exactly the same thing. Unless the delay
is gonna be multiple times longer than avg flush time, I don't think
the difference between the above scheme and the one w/ preemptive
delay would be anything significant especially now that the cost of
flush is much lower. Also, as Neil pointed out in another message,
the above scheme will result in lower latency for flushes issued while
no flush is in progress.

IMO, this kind of optimization is gonna make noticeable difference
only when there are a lot of simulatenous fsyncs, in which case the
above would behave in mostly identical way with the more elaborate
timer based one anyway.

Thanks.

--
tejun

2010-11-30 13:56:31

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On 11/30/2010 08:45 AM, Tejun Heo wrote:
> Hello,
>
> On 11/30/2010 01:39 AM, Neil Brown wrote:
>> I haven't seen any of the preceding discussion do I might be missing
>> something important, but this seems needlessly complex and intrusive.
>> In particular, I don't like adding code to md to propagate these timings up
>> to the fs, and I don't the arbitrary '2ms' number.
>>
>> Would it not be sufficient to simply gather flushes while a flush is pending.
>> i.e
>> - if no flush is pending, set the 'flush pending' flag, submit a flush,
>> then clear the flag.
>> - if a flush is pending, then wait for it to complete, and then submit a
>> single flush on behalf of all pending flushes.
> Heh, I was about to suggest exactly the same thing. Unless the delay
> is gonna be multiple times longer than avg flush time, I don't think
> the difference between the above scheme and the one w/ preemptive
> delay would be anything significant especially now that the cost of
> flush is much lower. Also, as Neil pointed out in another message,
> the above scheme will result in lower latency for flushes issued while
> no flush is in progress.
>
> IMO, this kind of optimization is gonna make noticeable difference
> only when there are a lot of simulatenous fsyncs, in which case the
> above would behave in mostly identical way with the more elaborate
> timer based one anyway.
>
> Thanks.
>

When we played with this in ext3/4, it was important to not wait when doing
single threaded fsync's (a pretty common case) since that would just make them
slower.

Also, the wait time for multi-threaded fsync's should be capped at some fraction
of the time to complete a flush. For example, we had ATA_CACHE_FLUSH_EXT
commands that took say 16ms or so to flush and waited one jiffie (4ms) and that
worked well. It tanked when we used that fixed waiting time for a high speed
device that could execute a flush in say 1ms (meaning we waited 4 times as long
as it would have taken to just submit the fsync().

I am still not clear that the scheme that you and Neil are proposing would
really batch up enough flushes to help though since you effectively do not wait.

The workload that we used years back was single threaded fs_mark (small files),
2 threads, 4 threads, 8 threads, 16 threads.

Single threaded should show no slow down with various schemes showing
multi-threaded writes grow with the number threads to some point....

Ric

2010-11-30 16:42:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

FYI, here's an updated version of my patch to not run pure flushes
that also works on SCSI/ATA and not just virtio. I suspect the patch
alone might not be enough, but together with a variant of Neil's
suggestion might do the trick, with my patch taking care of the
highend-devices and Neil's scheme of taking care of stupid SATA disks.


Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c 2010-11-30 17:27:33.108254088 +0100
+++ linux-2.6/block/blk-flush.c 2010-11-30 17:27:38.790004333 +0100
@@ -143,6 +143,17 @@ struct request *blk_do_flush(struct requ
unsigned skip = 0;

/*
+ * Just issue pure flushes directly.
+ */
+ if (!blk_rq_sectors(rq)) {
+ if (!do_preflush) {
+ __blk_end_request_all(rq, 0);
+ return NULL;
+ }
+ return rq;
+ }
+
+ /*
* Special case. If there's data but flush is not necessary,
* the request can be issued directly.
*
Index: linux-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_lib.c 2010-11-30 17:27:33.120254298 +0100
+++ linux-2.6/drivers/scsi/scsi_lib.c 2010-11-30 17:27:38.791003634 +0100
@@ -1060,7 +1060,7 @@ int scsi_setup_blk_pc_cmnd(struct scsi_d
* that does not transfer data, in which case they may optionally
* submit a request without an attached bio.
*/
- if (req->bio) {
+ if (req->bio && !(req->cmd_flags & REQ_FLUSH)) {
int ret;

BUG_ON(!req->nr_phys_segments);

2010-11-30 16:43:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Tue, Nov 30, 2010 at 11:39:06AM +1100, Neil Brown wrote:
> Would it not be sufficient to simply gather flushes while a flush is pending.
> i.e
> - if no flush is pending, set the 'flush pending' flag, submit a flush,
> then clear the flag.
> - if a flush is pending, then wait for it to complete, and then submit a
> single flush on behalf of all pending flushes.
>
> That way when flush is fast, you do a flush every time, and when it is slow
> you gather multiple flushes together.

We can even optimize the second flush away if no other I/O has
completed since the first flush has started. That will always be the
case for SATA devices as the cache flush command is not queued.

2010-11-30 23:31:32

by djwong

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Tue, Nov 30, 2010 at 11:39:06AM +1100, Neil Brown wrote:
> On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong" <[email protected]>
> wrote:
>
> > On certain types of hardware, issuing a write cache flush takes a considerable
> > amount of time. Typically, these are simple storage systems with write cache
> > enabled and no battery to save that cache after a power failure. When we
> > encounter a system with many I/O threads that write data and then call fsync
> > after more transactions accumulate, ext4_sync_file performs a data-only flush,
> > the performance of which is suboptimal because each of those threads issues its
> > own flush command to the drive instead of trying to coordinate the flush,
> > thereby wasting execution time.
> >
> > Instead of each fsync call initiating its own flush, there's now a flag to
> > indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> > collect other fsync threads, or (2) we're actually in-progress on a flush.
> >
> > So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> > shifts from 0->1 and the thread delays for a short time to see if there are any
> > other threads that are close behind in ext4_sync_file. After that wait, the
> > state transitions to 2 and the flush is issued. Once that's done, the state
> > goes back to 0 and a completion is signalled.
>
> I haven't seen any of the preceding discussion do I might be missing
> something important, but this seems needlessly complex and intrusive.
> In particular, I don't like adding code to md to propagate these timings up
> to the fs, and I don't the arbitrary '2ms' number.
>
> Would it not be sufficient to simply gather flushes while a flush is pending.
> i.e
> - if no flush is pending, set the 'flush pending' flag, submit a flush,
> then clear the flag.
> - if a flush is pending, then wait for it to complete, and then submit a
> single flush on behalf of all pending flushes.
>
> That way when flush is fast, you do a flush every time, and when it is slow
> you gather multiple flushes together.
> I think it would issues a few more flushes than your scheme, but it would be
> a much neater solution. Have you tried that and found it to be insufficient?

Some time ago I actually did test the patchset with the schedule_hrtimeout
removed, which I think is fairly close to what you've suggested. As I recall,
it did help a bit, though not as much as also instituting the wait to limit the
% of disk execution time spent on flushes. That said, I think you might be
right about the completion races, so I'll try to code up your suggestion to
see what happens with a newer kernel.

--D
>
> Thanks,
> NeilBrown
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-11-30 23:32:58

by djwong

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Tue, Nov 30, 2010 at 12:26:37PM +1100, Neil Brown wrote:
> On Mon, 29 Nov 2010 19:48:52 -0500 Ric Wheeler <[email protected]> wrote:
>
> > On 11/29/2010 07:39 PM, Neil Brown wrote:
> > > On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong"<[email protected]>
> > > wrote:
> > >
> > >> On certain types of hardware, issuing a write cache flush takes a considerable
> > >> amount of time. Typically, these are simple storage systems with write cache
> > >> enabled and no battery to save that cache after a power failure. When we
> > >> encounter a system with many I/O threads that write data and then call fsync
> > >> after more transactions accumulate, ext4_sync_file performs a data-only flush,
> > >> the performance of which is suboptimal because each of those threads issues its
> > >> own flush command to the drive instead of trying to coordinate the flush,
> > >> thereby wasting execution time.
> > >>
> > >> Instead of each fsync call initiating its own flush, there's now a flag to
> > >> indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> > >> collect other fsync threads, or (2) we're actually in-progress on a flush.
> > >>
> > >> So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> > >> shifts from 0->1 and the thread delays for a short time to see if there are any
> > >> other threads that are close behind in ext4_sync_file. After that wait, the
> > >> state transitions to 2 and the flush is issued. Once that's done, the state
> > >> goes back to 0 and a completion is signalled.
> > > I haven't seen any of the preceding discussion do I might be missing
> > > something important, but this seems needlessly complex and intrusive.
> > > In particular, I don't like adding code to md to propagate these timings up
> > > to the fs, and I don't the arbitrary '2ms' number.
> > >
> > > Would it not be sufficient to simply gather flushes while a flush is pending.
> > > i.e
> > > - if no flush is pending, set the 'flush pending' flag, submit a flush,
> > > then clear the flag.
> > > - if a flush is pending, then wait for it to complete, and then submit a
> > > single flush on behalf of all pending flushes.
> > >
> > > That way when flush is fast, you do a flush every time, and when it is slow
> > > you gather multiple flushes together.
> > > I think it would issues a few more flushes than your scheme, but it would be
> > > a much neater solution. Have you tried that and found it to be insufficient?
> > >
> > > Thanks,
> > > NeilBrown
> > >
> >
> > The problem with that is that you can introduce a wait for the next flush longer
> > than it would take to complete the flush. Having the wait adjust itself
> > according to the speed of the device is much better I think....
> >
>
> Hi Ric,
>
> You certainly can introduce a wait longer than the flush would take, but
> the proposed code does that too.
>
> The proposed code waits "average flush time", then submits a flush for
> all threads that are waiting.
> My suggestion submits a flush (Which on average takes 'average flush time')
> and then submits a flush for all threads that started waiting during that
> time.
>
> So we do get two flushes rather than one, but also the flush starts
> straight away, so will presumably finish sooner.
>
> I do have the wait 'adjust itself according to the speed of the device' by
> making flushes wait for one flush to complete.
>
>
> I'm guessing that the situation where this is an issue is when you have a
> nearly constant stream of flush requests - is that right?

Yes.

--D

> In that case:
> - the current code would submit lots of over-lapping flush requests,
> reducing the opportunity for optimising multiple requests in the
> device,
> - my proposal would submit a steady sequence of flushes with minimal
> gaps
> - the code from Darrick would submit a steady sequence of flush requests
> which would be separated by pauses of 'average flush time'.
> This would increase latency, but might increase throughput too, I
> don't know.
>
> So it seems to me that the performance differences between my suggesting
> and Darrick's proposal are not obvious. So some testing would be
> interesting.
>
> I was going to suggest changes to Darrick's code to demonstrate my idea, but
> I think that code is actually wrong, so it wouldn't be a good base to start.
>
> In particular, the usage of a continuation seems racy.
> As soon as one thread sets EXT4_FLUSH_IDLE, another thread could call
> INIT_COMPLETION, before some other threads has had a chance to wake up and
> test the completion status in wait_for_completion.
> The effect is that the other thread would wait for an extra time+flush which
> it shouldn't have to. So it isn't a serious race, but it looks wrong.
>
> NeilBrown
>

2010-12-01 00:14:51

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync

On Mon, 2010-11-29 at 18:48 -0500, Ric Wheeler wrote:
> On 11/29/2010 05:05 PM, Darrick J. Wong wrote:
> > On certain types of hardware, issuing a write cache flush takes a considerable
> > amount of time. Typically, these are simple storage systems with write cache
> > enabled and no battery to save that cache after a power failure. When we
> > encounter a system with many I/O threads that write data and then call fsync
> > after more transactions accumulate, ext4_sync_file performs a data-only flush,
> > the performance of which is suboptimal because each of those threads issues its
> > own flush command to the drive instead of trying to coordinate the flush,
> > thereby wasting execution time.
> >
> > Instead of each fsync call initiating its own flush, there's now a flag to
> > indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
> > collect other fsync threads, or (2) we're actually in-progress on a flush.
> >
> > So, if someone calls ext4_sync_file and no flushes are in progress, the flag
> > shifts from 0->1 and the thread delays for a short time to see if there are any
> > other threads that are close behind in ext4_sync_file. After that wait, the
> > state transitions to 2 and the flush is issued. Once that's done, the state
> > goes back to 0 and a completion is signalled.
> >
> > Those close-behind threads see the flag is already 1, and go to sleep until the
> > completion is signalled. Instead of issuing a flush themselves, they simply
> > wait for that first thread to do it for them. If they see that the flag is 2,
> > they wait for the current flush to finish, and start over.
> >
> > However, there are a couple of exceptions to this rule. First, there exist
> > high-end storage arrays with battery-backed write caches for which flush
> > commands take very little time (< 2ms); on these systems, performing the
> > coordination actually lowers performance. Given the earlier patch to the block
> > layer to report low-level device flush times, we can detect this situation and
> > have all threads issue flushes without coordinating, as we did before. The
> > second case is when there's a single thread issuing flushes, in which case it
> > can skip the coordination.
> >
> > This author of this patch is aware that jbd2 has a similar flush coordination
> > scheme for journal commits. An earlier version of this patch simply created a
> > new empty journal transaction and committed it, but that approach was shown to
> > increase the amount of write traffic heading towards the disk, which in turn
> > lowered performance considerably, especially in the case where directio was in
> > use. Therefore, this patch adds the coordination code directly to ext4.
>
> Hi Darrick,
>
> Just curious why we would need to have batching in both places? Doesn't your
> patch set make the jbd2 transaction batching redundant?
>

We hoped JBD2 could take care of the batching too. But ftrace shows
there are a fair amount of barriers (440 barriers/second, if I remember
right) sending from ext4_sync_file(), but not coming from jbd2 side. :(

> I noticed that the patches have a default delay and a mount option to override
> that default. The jbd2 code today tries to measure the average time needed in a
> transaction and automatically tune itself. Can't we do something similar with
> your patch set? (I hate to see yet another mount option added!)
>

I don't like a new mount option too. Darrick's new mount option is for
the threshold to turn on/off batching. Probably could make it a tunables
instead of a mount option.

Regards,
Mingming
> Regards,
>
> Ric
>
>

2010-12-02 09:50:16

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 1/4] block: Measure flush round-trip times and report average value

On Mon, 29 Nov 2010, Darrick J. Wong wrote:

> This patch adds to the block layer the ability to intercept flush requests for
> the purpose of measuring the round-trip times of the write-cache flush command
> on whatever hardware sits underneath the block layer. The eventual intent of
> this patch is for higher-level clients (filesystems) to be able to measure
> flush times to tune their use of them.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> block/blk-core.c | 13 +++++++++++++
> block/genhd.c | 39 +++++++++++++++++++++++++++++++++++++++
> fs/bio.c | 11 +++++++++++
> include/linux/blk_types.h | 2 ++
> include/linux/blkdev.h | 2 ++
> include/linux/genhd.h | 23 +++++++++++++++++++++++
> 6 files changed, 90 insertions(+), 0 deletions(-)
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..233573a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1540,6 +1540,9 @@ static inline void __generic_make_request(struct bio *bio)
>
> trace_block_bio_queue(q, bio);
>
> + if (bio->bi_rw & REQ_FLUSH)
> + bio->flush_start_time_ns = ktime_to_ns(ktime_get());
> +
> ret = q->make_request_fn(q, bio);
> } while (ret);
>
> @@ -1954,6 +1957,9 @@ void blk_start_request(struct request *req)
> req->next_rq->resid_len = blk_rq_bytes(req->next_rq);
>
> blk_add_timer(req);
> +
> + if (req->cmd_flags & REQ_FLUSH)
> + req->flush_start_time_ns = ktime_to_ns(ktime_get());
> }
> EXPORT_SYMBOL(blk_start_request);
>
> @@ -2182,6 +2188,13 @@ EXPORT_SYMBOL_GPL(blk_unprep_request);
> */
> static void blk_finish_request(struct request *req, int error)
> {
> + if (!error && req->cmd_flags & REQ_FLUSH) {
> + u64 delta;
> +
> + delta = ktime_to_ns(ktime_get()) - req->flush_start_time_ns;
> + update_flush_times(req->rq_disk, delta);
> + }
> +
> if (blk_rq_tagged(req))
> blk_queue_end_tag(req->q, req);
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 5fa2b44..465bd00 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -538,6 +538,9 @@ void add_disk(struct gendisk *disk)
> */
> disk->major = MAJOR(devt);
> disk->first_minor = MINOR(devt);
> + spin_lock_init(&disk->flush_time_lock);
> + disk->avg_flush_time_ns = 0;
> + disk->flush_samples = 0;
>
> /* Register BDI before referencing it from bdev */
> bdi = &disk->queue->backing_dev_info;
> @@ -824,6 +827,37 @@ static ssize_t disk_range_show(struct device *dev,
> return sprintf(buf, "%d\n", disk->minors);
> }
>
> +static ssize_t disk_flush_samples_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + return sprintf(buf, "%llu\n", disk->flush_samples);
> +}
> +
> +static ssize_t disk_avg_flush_time_ns_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + return sprintf(buf, "%llu\n", disk->avg_flush_time_ns);
> +}
> +
> +static ssize_t disk_avg_flush_time_ns_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)

Hi,

This is not exactly helpful name for the function since it does not
store anything, but rather clear the counters. Would not be
disk_avg_flush_time_ns_clear a better name ?

Thanks!

-Lukas

> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + spin_lock(&disk->flush_time_lock);
> + disk->avg_flush_time_ns = 0;
> + disk->flush_samples = 0;
> + spin_unlock(&disk->flush_time_lock);
> +
> + return count;
> +}
> +
> +
> static ssize_t disk_ext_range_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -876,6 +910,9 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
> }
>
> static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
> +static DEVICE_ATTR(avg_flush_time_ns, S_IRUGO | S_IWUSR,
> + disk_avg_flush_time_ns_show, disk_avg_flush_time_ns_store);
> +static DEVICE_ATTR(flush_samples, S_IRUGO, disk_flush_samples_show, NULL);
> static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
> static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
> static DEVICE_ATTR(ro, S_IRUGO, disk_ro_show, NULL);
> @@ -898,6 +935,8 @@ static struct device_attribute dev_attr_fail_timeout =
>
> static struct attribute *disk_attrs[] = {
> &dev_attr_range.attr,
> + &dev_attr_avg_flush_time_ns.attr,
> + &dev_attr_flush_samples.attr,
> &dev_attr_ext_range.attr,
> &dev_attr_removable.attr,
> &dev_attr_ro.attr,
> diff --git a/fs/bio.c b/fs/bio.c
> index 4bd454f..aab5f27 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1442,11 +1442,22 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
> **/
> void bio_endio(struct bio *bio, int error)
> {
> + struct request_queue *q = NULL;
> +
> if (error)
> clear_bit(BIO_UPTODATE, &bio->bi_flags);
> else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
> error = -EIO;
>
> + if (bio->bi_bdev)
> + q = bdev_get_queue(bio->bi_bdev);
> + if (!(q && q->prep_rq_fn) && !error && bio->bi_rw & REQ_FLUSH) {
> + u64 delta;
> +
> + delta = ktime_to_ns(ktime_get()) - bio->flush_start_time_ns;
> + update_flush_times(bio->bi_bdev->bd_disk, delta);
> + }
> +
> if (bio->bi_end_io)
> bio->bi_end_io(bio, error);
> }
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 46ad519..e8c29b9 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -74,6 +74,8 @@ struct bio {
>
> bio_destructor_t *bi_destructor; /* destructor */
>
> + u64 flush_start_time_ns;
> +
> /*
> * We can inline a number of vecs at the end of the bio, to avoid
> * double allocations for a small number of bio_vecs. This member
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aae86fd..357d669 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -163,6 +163,8 @@ struct request {
>
> /* for bidi */
> struct request *next_rq;
> +
> + u64 flush_start_time_ns;
> };
>
> static inline unsigned short req_get_ioprio(struct request *req)
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7a7b9c1..7097396 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -178,8 +178,31 @@ struct gendisk {
> struct blk_integrity *integrity;
> #endif
> int node_id;
> +
> + spinlock_t flush_time_lock;
> + u64 avg_flush_time_ns;
> + u64 flush_samples;
> };
>
> +static inline void update_flush_times(struct gendisk *disk, u64 delta)
> +{
> + u64 data;
> +
> + spin_lock(&disk->flush_time_lock);
> + if (disk->flush_samples < 256)
> + disk->flush_samples++;
> + if (!disk->avg_flush_time_ns)
> + disk->avg_flush_time_ns = delta;
> + else {
> + data = disk->avg_flush_time_ns;
> + data *= 255;
> + data += delta;
> + data /= 256;
> + disk->avg_flush_time_ns = data;
> + }
> + spin_unlock(&disk->flush_time_lock);
> +}
> +
> static inline struct gendisk *part_to_disk(struct hd_struct *part)
> {
> if (likely(part)) {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>