LinuxLists.cc - [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

2010-04-14 21:17:11

Subject: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

Hi,

The previous two postings can be found here:
http://lkml.org/lkml/2010/4/1/344
and here:
http://lkml.org/lkml/2010/4/7/325

The basic problem is that, when running iozone on smallish files (up to
8MB in size) and including fsync in the timings, deadline outperforms
CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
files. From examining the blktrace data, it appears that iozone will
issue an fsync() call, and subsequently wait until its CFQ timeslice
has expired before the journal thread can run to actually commit data to
disk.

The approach taken to solve this problem is to implement a blk_yield call,
which tells the I/O scheduler not to idle on this process' queue. The call
is made from the jbd[2] log_wait_commit function.

This patch set addresses previous concerns that the sync-noidle workload
would be starved by keeping track of the average think time for that
workload and using that to decide whether or not to yield the queue.

My testing showed nothing but improvements for mixed workloads, though I
wouldn't call the testing exhaustive. I'd still very much like feedback
on the approach from jbd/jbd2 developers. Finally, I will continue to do
performance analysis of the patches.

Cheers,
Jeff

[PATCH 1/4] cfq-iosched: Keep track of average think time for the sync-noidle workload.
[PATCH 2/4] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.
[PATCH 3/4] jbd: yield the device queue when waiting for commits
[PATCH 4/4] jbd2: yield the device queue when waiting for journal commits

2010-04-14 21:17:18

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 4/4] jbd2: yield the device queue when waiting for journal commits

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd2/journal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index c03d4dc..ce46df6 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -41,6 +41,7 @@
#include <linux/hash.h>
#include <linux/log2.h>
#include <linux/vmalloc.h>
+#include <linux/blkdev.h>

#define CREATE_TRACE_POINTS
#include <trace/events/jbd2.h>
@@ -580,6 +581,11 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.2.5

2010-04-14 21:17:04

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 2/4] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.

This patch implements a blk_yield to allow a process to voluntarily
give up its I/O scheduler time slice. This is desirable for those processes
which know that they will be blocked on I/O from another process, such as
the file system journal thread. Following patches will put calls to blk_yield
into jbd and jbd2.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/blk-core.c | 6 ++++
block/cfq-iosched.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
block/elevator.c | 8 +++++
include/linux/blkdev.h | 1 +
include/linux/elevator.h | 3 ++
5 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..3e4e98c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -323,6 +323,12 @@ void blk_unplug(struct request_queue *q)
}
EXPORT_SYMBOL(blk_unplug);

+void blk_yield(struct request_queue *q)
+{
+ elv_yield(q);
+}
+EXPORT_SYMBOL(blk_yield);
+
/**
* blk_start_queue - restart a previously stopped queue
* @q: The &struct request_queue in question
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ef59ab3..8a300ab 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -292,6 +292,7 @@ struct cfq_data {
};

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+static void cfq_yield_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
@@ -320,6 +321,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_yield, /* Allow another cfqq to run */
};

#define CFQ_CFQQ_FNS(name) \
@@ -349,6 +351,7 @@ CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(split_coop);
CFQ_CFQQ_FNS(deep);
CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(yield);
#undef CFQ_CFQQ_FNS

#ifdef CONFIG_DEBUG_CFQ_IOSCHED
@@ -1566,6 +1569,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_wait_busy(cfqq);
+ cfq_clear_cfqq_yield(cfqq);

/*
* If this cfqq is shared between multiple processes, check to
@@ -1887,6 +1891,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
cfqq->nr_sectors += blk_rq_sectors(rq);
+
+ if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_yield_cfqq(cfqd, cfqq);
}

/*
@@ -2191,6 +2198,68 @@ keep_queue:
return cfqq;
}

+static void cfq_yield_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ __cfq_slice_expired(cfqd, cfqq, 1);
+ __blk_run_queue(cfqd->queue);
+}
+
+static void cfq_yield(struct request_queue *q)
+{
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_io_context *cic;
+ struct cfq_queue *cfqq;
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ spin_lock_irq(q->queue_lock);
+
+ /*
+ * This is primarily called to ensure that the long synchronous
+ * time slice does not prevent other I/O happenning (like journal
+ * commits) while we idle waiting for it. Thus, check to see if the
+ * current cfqq is the sync cfqq for this process.
+ */
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq)
+ goto out_unlock;
+
+ if (cfqd->active_queue != cfqq)
+ goto out_unlock;
+
+ /*
+ * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
+ * are idling on the last queue in that workload, *and* the average
+ * think time is larger thank the remaining slice time, go ahead
+ * and yield the queue. Otherwise, don't yield so that fsync-heavy
+ * workloads don't starve out the sync-noidle workload.
+ */
+ if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
+ (!sample_valid(cfqq->service_tree->ttime_samples) ||
+ cfqq->slice_end - jiffies > cfqq->service_tree->ttime_mean))
+ goto out_unlock;
+
+
+ cfq_log_cfqq(cfqd, cfqq, "yielding queue");
+
+ /*
+ * If there are other requests pending, just mark the queue as
+ * yielding and give up our slice after the last request is
+ * dispatched.
+ */
+ if (!RB_EMPTY_ROOT(&cfqq->sort_list)) {
+ cfq_mark_cfqq_yield(cfqq);
+ goto out_unlock;
+ }
+
+ cfq_yield_cfqq(cfqd, cfqq);
+
+out_unlock:
+ spin_unlock_irq(q->queue_lock);
+}
+
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
{
int dispatched = 0;
@@ -3911,6 +3980,7 @@ static struct elevator_type iosched_cfq = {
.elevator_deactivate_req_fn = cfq_deactivate_request,
.elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
+ .elevator_yield_fn = cfq_yield,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_set_req_fn = cfq_set_request,
diff --git a/block/elevator.c b/block/elevator.c
index 76e3702..6b16421 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -855,6 +855,14 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
}
}

+void elv_yield(struct request_queue *q)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->ops->elevator_yield_fn)
+ e->ops->elevator_yield_fn(q);
+}
+
#define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)

static ssize_t
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6690e8b..0e749e2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -833,6 +833,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
struct request *, int, rq_end_io_fn *);
extern void blk_unplug(struct request_queue *q);
+extern void blk_yield(struct request_queue *q);

static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
{
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..9b4e2e9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -20,6 +20,7 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_queue_empty_fn) (struct request_queue *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
+typedef void (elevator_yield_fn) (struct request_queue *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
@@ -44,6 +45,7 @@ struct elevator_ops

elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
+ elevator_yield_fn *elevator_yield_fn;

elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
@@ -105,6 +107,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
extern void elv_merged_request(struct request_queue *, struct request *, int);
extern void elv_requeue_request(struct request_queue *, struct request *);
extern int elv_queue_empty(struct request_queue *);
+extern void elv_yield(struct request_queue *);
extern struct request *elv_former_request(struct request_queue *, struct request *);
extern struct request *elv_latter_request(struct request_queue *, struct request *);
extern int elv_register_queue(struct request_queue *q);
--
1.6.2.5

2010-04-14 21:17:03

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 1/4] cfq-iosched: Keep track of average think time for the sync-noidle workload.

This patch uses an average think time for the entirety of the sync-noidle
workload to determine whether or not to idle on said workload. This brings
it more in line with the policy for the sync queues in the sync workload.

Testing shows that this provided an overall increase in throughput for
a mixed workload on my hardware RAID array.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/cfq-iosched.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 838834b..ef59ab3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -83,9 +83,14 @@ struct cfq_rb_root {
unsigned total_weight;
u64 min_vdisktime;
struct rb_node *active;
+ unsigned long last_end_request;
+ unsigned long ttime_total;
+ unsigned long ttime_samples;
+ unsigned long ttime_mean;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
- .count = 0, .min_vdisktime = 0, }
+ .count = 0, .min_vdisktime = 0, .last_end_request = 0, \
+ .ttime_total = 0, .ttime_samples = 0, .ttime_mean = 0 }

/*
* Per process-grouping structure
@@ -962,8 +967,10 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
goto done;

cfqg->weight = blkcg->weight;
- for_each_cfqg_st(cfqg, i, j, st)
+ for_each_cfqg_st(cfqg, i, j, st) {
*st = CFQ_RB_ROOT;
+ st->last_end_request = jiffies;
+ }
RB_CLEAR_NODE(&cfqg->rb_node);

/*
@@ -1795,9 +1802,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)

/*
* Otherwise, we do only if they are the last ones
- * in their service tree.
+ * in their service tree and the average think time is
+ * less than the slice length.
*/
- if (service_tree->count == 1 && cfq_cfqq_sync(cfqq))
+ if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) &&
+ (!sample_valid(service_tree->ttime_samples ||
+ cfqq->slice_end - jiffies < service_tree->ttime_mean)))
return 1;
cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d",
service_tree->count);
@@ -2988,6 +2998,18 @@ err:
}

static void
+cfq_update_st_thinktime(struct cfq_data *cfqd, struct cfq_rb_root *service_tree)
+{
+ unsigned long elapsed = jiffies - service_tree->last_end_request;
+ unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
+
+ service_tree->ttime_samples = (7*service_tree->ttime_samples + 256) / 8;
+ service_tree->ttime_total = (7*service_tree->ttime_total + 256*ttime)/8;
+ service_tree->ttime_mean = (service_tree->ttime_total + 128) /
+ service_tree->ttime_samples;
+}
+
+static void
cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
{
unsigned long elapsed = jiffies - cic->last_end_request;
@@ -3166,6 +3188,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->meta_pending++;

cfq_update_io_thinktime(cfqd, cic);
+ cfq_update_st_thinktime(cfqd, cfqq->service_tree);
cfq_update_io_seektime(cfqd, cfqq, rq);
cfq_update_idle_window(cfqd, cfqq, cic);

@@ -3304,7 +3327,16 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;

if (sync) {
+ struct cfq_rb_root *st;
+
RQ_CIC(rq)->last_end_request = now;
+ /*
+ * cfqq->service_tree is only filled in while on the rb tree,
+ * so we need to lookup the service tree here.
+ */
+ st = service_tree_for(cfqq->cfqg,
+ cfqq_prio(cfqq), cfqq_type(cfqq));
+ st->last_end_request = now;
if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
cfqd->last_delayed_sync = now;
}
@@ -3678,11 +3710,14 @@ static void *cfq_init_queue(struct request_queue *q)

/* Init root service tree */
cfqd->grp_service_tree = CFQ_RB_ROOT;
+ cfqd->grp_service_tree.last_end_request = jiffies;

/* Init root group */
cfqg = &cfqd->root_group;
- for_each_cfqg_st(cfqg, i, j, st)
+ for_each_cfqg_st(cfqg, i, j, st) {
*st = CFQ_RB_ROOT;
+ st->last_end_request = jiffies;
+ }
RB_CLEAR_NODE(&cfqg->rb_node);

/* Give preference to root group over other groups */
--
1.6.2.5

2010-04-14 21:17:05

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 3/4] jbd: yield the device queue when waiting for commits

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd/journal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index bd224ee..c1df4b2 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/blkdev.h>

#include <asm/uaccess.h>
#include <asm/page.h>
@@ -549,6 +550,11 @@ int log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.2.5

2010-04-14 21:37:50

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH 1/4] cfq-iosched: Keep track of average think time for the sync-noidle workload.

On Wed, Apr 14, 2010 at 05:17:03PM -0400, Jeff Moyer wrote:
> This patch uses an average think time for the entirety of the sync-noidle
> workload to determine whether or not to idle on said workload. This brings
> it more in line with the policy for the sync queues in the sync workload.
>
> Testing shows that this provided an overall increase in throughput for
> a mixed workload on my hardware RAID array.
>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> block/cfq-iosched.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 838834b..ef59ab3 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -83,9 +83,14 @@ struct cfq_rb_root {
> unsigned total_weight;
> u64 min_vdisktime;
> struct rb_node *active;
> + unsigned long last_end_request;
> + unsigned long ttime_total;
> + unsigned long ttime_samples;
> + unsigned long ttime_mean;
> };
> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
> - .count = 0, .min_vdisktime = 0, }
> + .count = 0, .min_vdisktime = 0, .last_end_request = 0, \
> + .ttime_total = 0, .ttime_samples = 0, .ttime_mean = 0 }
>
> /*
> * Per process-grouping structure
> @@ -962,8 +967,10 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> goto done;
>
> cfqg->weight = blkcg->weight;
> - for_each_cfqg_st(cfqg, i, j, st)
> + for_each_cfqg_st(cfqg, i, j, st) {
> *st = CFQ_RB_ROOT;
> + st->last_end_request = jiffies;
> + }
> RB_CLEAR_NODE(&cfqg->rb_node);
>
> /*
> @@ -1795,9 +1802,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>
> /*
> * Otherwise, we do only if they are the last ones
> - * in their service tree.
> + * in their service tree and the average think time is
> + * less than the slice length.
> */
> - if (service_tree->count == 1 && cfq_cfqq_sync(cfqq))
> + if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) &&
> + (!sample_valid(service_tree->ttime_samples ||
> + cfqq->slice_end - jiffies < service_tree->ttime_mean)))
> return 1;
> cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d",
> service_tree->count);
> @@ -2988,6 +2998,18 @@ err:
> }
>
> static void
> +cfq_update_st_thinktime(struct cfq_data *cfqd, struct cfq_rb_root *service_tree)
> +{
> + unsigned long elapsed = jiffies - service_tree->last_end_request;
> + unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
> +
> + service_tree->ttime_samples = (7*service_tree->ttime_samples + 256) / 8;
> + service_tree->ttime_total = (7*service_tree->ttime_total + 256*ttime)/8;
> + service_tree->ttime_mean = (service_tree->ttime_total + 128) /
> + service_tree->ttime_samples;
> +}
> +
> +static void
> cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
> {
> unsigned long elapsed = jiffies - cic->last_end_request;
> @@ -3166,6 +3188,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> cfqq->meta_pending++;
>
> cfq_update_io_thinktime(cfqd, cic);
> + cfq_update_st_thinktime(cfqd, cfqq->service_tree);
> cfq_update_io_seektime(cfqd, cfqq, rq);
> cfq_update_idle_window(cfqd, cfqq, cic);
>
> @@ -3304,7 +3327,16 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
>
> if (sync) {
> + struct cfq_rb_root *st;
> +
> RQ_CIC(rq)->last_end_request = now;
> + /*
> + * cfqq->service_tree is only filled in while on the rb tree,
> + * so we need to lookup the service tree here.
> + */
> + st = service_tree_for(cfqq->cfqg,
> + cfqq_prio(cfqq), cfqq_type(cfqq));
> + st->last_end_request = now;
> if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
> cfqd->last_delayed_sync = now;
> }
> @@ -3678,11 +3710,14 @@ static void *cfq_init_queue(struct request_queue *q)
>
> /* Init root service tree */
> cfqd->grp_service_tree = CFQ_RB_ROOT;
> + cfqd->grp_service_tree.last_end_request = jiffies;
>

This assignment is not required as we never update think time analsys of
service tree where all the groups are hanging. So for grp_service_tree,
last_end_request can be zero and there will be no harm.

Otherwise patch looks good to me. Can you please run some simple blkio
cgroup tests to make sure that functionality is not broken.

Acked-by: Vivek Goyal <[email protected]>

Thanks
Vivek

> /* Init root group */
> cfqg = &cfqd->root_group;
> - for_each_cfqg_st(cfqg, i, j, st)
> + for_each_cfqg_st(cfqg, i, j, st) {
> *st = CFQ_RB_ROOT;
> + st->last_end_request = jiffies;
> + }
> RB_CLEAR_NODE(&cfqg->rb_node);
>
> /* Give preference to root group over other groups */
> --
> 1.6.2.5

2010-04-14 21:46:59

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH 2/4] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.

On Wed, Apr 14, 2010 at 05:17:04PM -0400, Jeff Moyer wrote:
> This patch implements a blk_yield to allow a process to voluntarily
> give up its I/O scheduler time slice. This is desirable for those processes
> which know that they will be blocked on I/O from another process, such as
> the file system journal thread. Following patches will put calls to blk_yield
> into jbd and jbd2.
>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> block/blk-core.c | 6 ++++
> block/cfq-iosched.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
> block/elevator.c | 8 +++++
> include/linux/blkdev.h | 1 +
> include/linux/elevator.h | 3 ++
> 5 files changed, 88 insertions(+), 0 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9fe174d..3e4e98c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -323,6 +323,12 @@ void blk_unplug(struct request_queue *q)
> }
> EXPORT_SYMBOL(blk_unplug);
>
> +void blk_yield(struct request_queue *q)
> +{
> + elv_yield(q);
> +}
> +EXPORT_SYMBOL(blk_yield);
> +
> /**
> * blk_start_queue - restart a previously stopped queue
> * @q: The &struct request_queue in question
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index ef59ab3..8a300ab 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -292,6 +292,7 @@ struct cfq_data {
> };
>
> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> +static void cfq_yield_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq);
>
> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> enum wl_prio_t prio,
> @@ -320,6 +321,7 @@ enum cfqq_state_flags {
> CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
> CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
> CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
> + CFQ_CFQQ_FLAG_yield, /* Allow another cfqq to run */
> };
>
> #define CFQ_CFQQ_FNS(name) \
> @@ -349,6 +351,7 @@ CFQ_CFQQ_FNS(coop);
> CFQ_CFQQ_FNS(split_coop);
> CFQ_CFQQ_FNS(deep);
> CFQ_CFQQ_FNS(wait_busy);
> +CFQ_CFQQ_FNS(yield);
> #undef CFQ_CFQQ_FNS
>
> #ifdef CONFIG_DEBUG_CFQ_IOSCHED
> @@ -1566,6 +1569,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>
> cfq_clear_cfqq_wait_request(cfqq);
> cfq_clear_cfqq_wait_busy(cfqq);
> + cfq_clear_cfqq_yield(cfqq);
>
> /*
> * If this cfqq is shared between multiple processes, check to
> @@ -1887,6 +1891,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
>
> cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> cfqq->nr_sectors += blk_rq_sectors(rq);
> +
> + if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
> + cfq_yield_cfqq(cfqd, cfqq);

Jeff,

I am wondering if cfq_select_queue() will be a better place for yielding
the queue.

if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
goto expire;

We can avoid one unnecessary __blk_run_queue().

Apart from above minor nit, it looks good to me.

Acked-by: Vivek Goyal <[email protected]>

Thanks
Vivek

> }
>
> /*
> @@ -2191,6 +2198,68 @@ keep_queue:
> return cfqq;
> }
>
> +static void cfq_yield_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> +{
> + __cfq_slice_expired(cfqd, cfqq, 1);
> + __blk_run_queue(cfqd->queue);
> +}
> +
> +static void cfq_yield(struct request_queue *q)
> +{
> + struct cfq_data *cfqd = q->elevator->elevator_data;
> + struct cfq_io_context *cic;
> + struct cfq_queue *cfqq;
> +
> + cic = cfq_cic_lookup(cfqd, current->io_context);
> + if (!cic)
> + return;
> +
> + spin_lock_irq(q->queue_lock);
> +
> + /*
> + * This is primarily called to ensure that the long synchronous
> + * time slice does not prevent other I/O happenning (like journal
> + * commits) while we idle waiting for it. Thus, check to see if the
> + * current cfqq is the sync cfqq for this process.
> + */
> + cfqq = cic_to_cfqq(cic, 1);
> + if (!cfqq)
> + goto out_unlock;
> +
> + if (cfqd->active_queue != cfqq)
> + goto out_unlock;
> +
> + /*
> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
> + * are idling on the last queue in that workload, *and* the average
> + * think time is larger thank the remaining slice time, go ahead
> + * and yield the queue. Otherwise, don't yield so that fsync-heavy
> + * workloads don't starve out the sync-noidle workload.
> + */
> + if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
> + (!sample_valid(cfqq->service_tree->ttime_samples) ||
> + cfqq->slice_end - jiffies > cfqq->service_tree->ttime_mean))
> + goto out_unlock;
> +
> +
> + cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> +
> + /*
> + * If there are other requests pending, just mark the queue as
> + * yielding and give up our slice after the last request is
> + * dispatched.
> + */
> + if (!RB_EMPTY_ROOT(&cfqq->sort_list)) {
> + cfq_mark_cfqq_yield(cfqq);
> + goto out_unlock;
> + }
> +
> + cfq_yield_cfqq(cfqd, cfqq);
> +
> +out_unlock:
> + spin_unlock_irq(q->queue_lock);
> +}
> +
> static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
> {
> int dispatched = 0;
> @@ -3911,6 +3980,7 @@ static struct elevator_type iosched_cfq = {
> .elevator_deactivate_req_fn = cfq_deactivate_request,
> .elevator_queue_empty_fn = cfq_queue_empty,
> .elevator_completed_req_fn = cfq_completed_request,
> + .elevator_yield_fn = cfq_yield,
> .elevator_former_req_fn = elv_rb_former_request,
> .elevator_latter_req_fn = elv_rb_latter_request,
> .elevator_set_req_fn = cfq_set_request,
> diff --git a/block/elevator.c b/block/elevator.c
> index 76e3702..6b16421 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -855,6 +855,14 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
> }
> }
>
> +void elv_yield(struct request_queue *q)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->ops->elevator_yield_fn)
> + e->ops->elevator_yield_fn(q);
> +}
> +
> #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
>
> static ssize_t
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 6690e8b..0e749e2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -833,6 +833,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
> extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
> struct request *, int, rq_end_io_fn *);
> extern void blk_unplug(struct request_queue *q);
> +extern void blk_yield(struct request_queue *q);
>
> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
> {
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 1cb3372..9b4e2e9 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -20,6 +20,7 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
> typedef int (elevator_queue_empty_fn) (struct request_queue *);
> typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
> typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
> +typedef void (elevator_yield_fn) (struct request_queue *);
> typedef int (elevator_may_queue_fn) (struct request_queue *, int);
>
> typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
> @@ -44,6 +45,7 @@ struct elevator_ops
>
> elevator_queue_empty_fn *elevator_queue_empty_fn;
> elevator_completed_req_fn *elevator_completed_req_fn;
> + elevator_yield_fn *elevator_yield_fn;
>
> elevator_request_list_fn *elevator_former_req_fn;
> elevator_request_list_fn *elevator_latter_req_fn;
> @@ -105,6 +107,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
> extern void elv_merged_request(struct request_queue *, struct request *, int);
> extern void elv_requeue_request(struct request_queue *, struct request *);
> extern int elv_queue_empty(struct request_queue *);
> +extern void elv_yield(struct request_queue *);
> extern struct request *elv_former_request(struct request_queue *, struct request *);
> extern struct request *elv_latter_request(struct request_queue *, struct request *);
> extern int elv_register_queue(struct request_queue *q);
> --
> 1.6.2.5

2010-04-14 23:06:44

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 1/4] cfq-iosched: Keep track of average think time for the sync-noidle workload.

Vivek Goyal <[email protected]> writes:

> On Wed, Apr 14, 2010 at 05:17:03PM -0400, Jeff Moyer wrote:

>> @@ -3678,11 +3710,14 @@ static void *cfq_init_queue(struct request_queue *q)
>>
>> /* Init root service tree */
>> cfqd->grp_service_tree = CFQ_RB_ROOT;
>> + cfqd->grp_service_tree.last_end_request = jiffies;
>>
>
> This assignment is not required as we never update think time analsys of
> service tree where all the groups are hanging. So for grp_service_tree,
> last_end_request can be zero and there will be no harm.

OK, thanks.

> Otherwise patch looks good to me. Can you please run some simple blkio
> cgroup tests to make sure that functionality is not broken.

Yes, I'll add that to my list and get back to you.

Thanks for the review, Vivek.

Cheers,
Jeff

2010-04-15 10:33:24

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 2/4] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.

On Wed, Apr 14 2010, Vivek Goyal wrote:
> On Wed, Apr 14, 2010 at 05:17:04PM -0400, Jeff Moyer wrote:
> > This patch implements a blk_yield to allow a process to voluntarily
> > give up its I/O scheduler time slice. This is desirable for those processes
> > which know that they will be blocked on I/O from another process, such as
> > the file system journal thread. Following patches will put calls to blk_yield
> > into jbd and jbd2.
> >
> > Signed-off-by: Jeff Moyer <[email protected]>
> > ---
> > block/blk-core.c | 6 ++++
> > block/cfq-iosched.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
> > block/elevator.c | 8 +++++
> > include/linux/blkdev.h | 1 +
> > include/linux/elevator.h | 3 ++
> > 5 files changed, 88 insertions(+), 0 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 9fe174d..3e4e98c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -323,6 +323,12 @@ void blk_unplug(struct request_queue *q)
> > }
> > EXPORT_SYMBOL(blk_unplug);
> >
> > +void blk_yield(struct request_queue *q)
> > +{
> > + elv_yield(q);
> > +}
> > +EXPORT_SYMBOL(blk_yield);
> > +
> > /**
> > * blk_start_queue - restart a previously stopped queue
> > * @q: The &struct request_queue in question
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index ef59ab3..8a300ab 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -292,6 +292,7 @@ struct cfq_data {
> > };
> >
> > static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> > +static void cfq_yield_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq);
> >
> > static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> > enum wl_prio_t prio,
> > @@ -320,6 +321,7 @@ enum cfqq_state_flags {
> > CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
> > CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
> > CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
> > + CFQ_CFQQ_FLAG_yield, /* Allow another cfqq to run */
> > };
> >
> > #define CFQ_CFQQ_FNS(name) \
> > @@ -349,6 +351,7 @@ CFQ_CFQQ_FNS(coop);
> > CFQ_CFQQ_FNS(split_coop);
> > CFQ_CFQQ_FNS(deep);
> > CFQ_CFQQ_FNS(wait_busy);
> > +CFQ_CFQQ_FNS(yield);
> > #undef CFQ_CFQQ_FNS
> >
> > #ifdef CONFIG_DEBUG_CFQ_IOSCHED
> > @@ -1566,6 +1569,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >
> > cfq_clear_cfqq_wait_request(cfqq);
> > cfq_clear_cfqq_wait_busy(cfqq);
> > + cfq_clear_cfqq_yield(cfqq);
> >
> > /*
> > * If this cfqq is shared between multiple processes, check to
> > @@ -1887,6 +1891,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
> >
> > cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> > cfqq->nr_sectors += blk_rq_sectors(rq);
> > +
> > + if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
> > + cfq_yield_cfqq(cfqd, cfqq);
>
> Jeff,
>
> I am wondering if cfq_select_queue() will be a better place for yielding
> the queue.
>
> if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
> goto expire;
>
> We can avoid one unnecessary __blk_run_queue().

Agree, doing it on insert is not the right place.

--
Jens Axboe

2010-04-15 10:33:39

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 4/4] jbd2: yield the device queue when waiting for journal commits

On Wed, Apr 14 2010, Jeff Moyer wrote:
> This patch gets CFQ back in line with deadline for iozone runs, especially
> those testing small files + fsync timings.
>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> fs/jbd2/journal.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index c03d4dc..ce46df6 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -41,6 +41,7 @@
> #include <linux/hash.h>
> #include <linux/log2.h>
> #include <linux/vmalloc.h>
> +#include <linux/blkdev.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/jbd2.h>
> @@ -580,6 +581,11 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> while (tid_gt(tid, journal->j_commit_sequence)) {
> jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
> tid, journal->j_commit_sequence);
> + /*
> + * Give up our I/O scheduler time slice to allow the journal
> + * thread to issue I/O.
> + */
> + blk_yield(journal->j_dev->bd_disk->queue);
> wake_up(&journal->j_wait_commit);
> spin_unlock(&journal->j_state_lock);
> wait_event(journal->j_wait_done_commit,

White space problem here.

--
Jens Axboe

2010-04-15 10:33:46

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

On Wed, Apr 14 2010, Jeff Moyer wrote:
> Hi,
>
> The previous two postings can be found here:
> http://lkml.org/lkml/2010/4/1/344
> and here:
> http://lkml.org/lkml/2010/4/7/325
>
> The basic problem is that, when running iozone on smallish files (up to
> 8MB in size) and including fsync in the timings, deadline outperforms
> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> files. From examining the blktrace data, it appears that iozone will
> issue an fsync() call, and subsequently wait until its CFQ timeslice
> has expired before the journal thread can run to actually commit data to
> disk.
>
> The approach taken to solve this problem is to implement a blk_yield call,
> which tells the I/O scheduler not to idle on this process' queue. The call
> is made from the jbd[2] log_wait_commit function.
>
> This patch set addresses previous concerns that the sync-noidle workload
> would be starved by keeping track of the average think time for that
> workload and using that to decide whether or not to yield the queue.
>
> My testing showed nothing but improvements for mixed workloads, though I
> wouldn't call the testing exhaustive. I'd still very much like feedback
> on the approach from jbd/jbd2 developers. Finally, I will continue to do
> performance analysis of the patches.

This is starting to look better. Can you share what tests you did? I
tried reproducing with fs_mark last time and could not.

--
Jens Axboe

2010-04-15 13:05:33

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

Jens Axboe <[email protected]> writes:

> On Wed, Apr 14 2010, Jeff Moyer wrote:
>> Hi,
>>
>> The previous two postings can be found here:
>> http://lkml.org/lkml/2010/4/1/344
>> and here:
>> http://lkml.org/lkml/2010/4/7/325
>>
>> The basic problem is that, when running iozone on smallish files (up to
>> 8MB in size) and including fsync in the timings, deadline outperforms
>> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
>> files. From examining the blktrace data, it appears that iozone will
>> issue an fsync() call, and subsequently wait until its CFQ timeslice
>> has expired before the journal thread can run to actually commit data to
>> disk.
>>
>> The approach taken to solve this problem is to implement a blk_yield call,
>> which tells the I/O scheduler not to idle on this process' queue. The call
>> is made from the jbd[2] log_wait_commit function.
>>
>> This patch set addresses previous concerns that the sync-noidle workload
>> would be starved by keeping track of the average think time for that
>> workload and using that to decide whether or not to yield the queue.
>>
>> My testing showed nothing but improvements for mixed workloads, though I
>> wouldn't call the testing exhaustive. I'd still very much like feedback
>> on the approach from jbd/jbd2 developers. Finally, I will continue to do
>> performance analysis of the patches.
>
> This is starting to look better. Can you share what tests you did? I
> tried reproducing with fs_mark last time and could not.

Did you use the fs_mark command line I (think I) had posted? What
storage were you using?

I took Vivek's iostest and modified the mixed workload to do buffered
random reader, buffered sequential reader, and buffered writer for all
of 1, 2, 4, 8 and 16 threads each.

The initial problem was reported against iozone, which can show the
problem quite easily when run like so:
iozone -s 64 -e -f /mnt/test/iozone.0 -i 0 -+n

You can also just run iozone in auto mode, but that can take quite a
while to complete.

All of my tests for this round have been against a NetApp hardware
RAID. I wanted to test against a simple sata disk as well, but have
become swamped with other issues.

I'll include all of this information in the next patch posting. Sorry
about that.

Cheers,
Jeff

2010-04-15 13:08:56

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

On Thu, Apr 15 2010, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
> > On Wed, Apr 14 2010, Jeff Moyer wrote:
> >> Hi,
> >>
> >> The previous two postings can be found here:
> >> http://lkml.org/lkml/2010/4/1/344
> >> and here:
> >> http://lkml.org/lkml/2010/4/7/325
> >>
> >> The basic problem is that, when running iozone on smallish files (up to
> >> 8MB in size) and including fsync in the timings, deadline outperforms
> >> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> >> files. From examining the blktrace data, it appears that iozone will
> >> issue an fsync() call, and subsequently wait until its CFQ timeslice
> >> has expired before the journal thread can run to actually commit data to
> >> disk.
> >>
> >> The approach taken to solve this problem is to implement a blk_yield call,
> >> which tells the I/O scheduler not to idle on this process' queue. The call
> >> is made from the jbd[2] log_wait_commit function.
> >>
> >> This patch set addresses previous concerns that the sync-noidle workload
> >> would be starved by keeping track of the average think time for that
> >> workload and using that to decide whether or not to yield the queue.
> >>
> >> My testing showed nothing but improvements for mixed workloads, though I
> >> wouldn't call the testing exhaustive. I'd still very much like feedback
> >> on the approach from jbd/jbd2 developers. Finally, I will continue to do
> >> performance analysis of the patches.
> >
> > This is starting to look better. Can you share what tests you did? I
> > tried reproducing with fs_mark last time and could not.
>
> Did you use the fs_mark command line I (think I) had posted? What
> storage were you using?

No, I didn't see any references to example command lines. I tested on a
few single disks, rotating and SSD. I expected the single spinning disk
to show the problem to some extent at least, but there was no difference
observed with 64kb blocks.

> I took Vivek's iostest and modified the mixed workload to do buffered
> random reader, buffered sequential reader, and buffered writer for all
> of 1, 2, 4, 8 and 16 threads each.
>
> The initial problem was reported against iozone, which can show the
> problem quite easily when run like so:
> iozone -s 64 -e -f /mnt/test/iozone.0 -i 0 -+n
>
> You can also just run iozone in auto mode, but that can take quite a
> while to complete.
>
> All of my tests for this round have been against a NetApp hardware
> RAID. I wanted to test against a simple sata disk as well, but have
> become swamped with other issues.
>
> I'll include all of this information in the next patch posting. Sorry
> about that.

No problem, I'll try the above.

--
Jens Axboe

2010-04-15 13:13:57

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

Jens Axboe <[email protected]> writes:

>> > This is starting to look better. Can you share what tests you did? I
>> > tried reproducing with fs_mark last time and could not.
>>
>> Did you use the fs_mark command line I (think I) had posted? What
>> storage were you using?
>
> No, I didn't see any references to example command lines. I tested on a
> few single disks, rotating and SSD. I expected the single spinning disk
> to show the problem to some extent at least, but there was no difference
> observed with 64kb blocks.

Boy, I'm really slipping. Try this one:

./fs_mark -S 1 -D 100 -N 1000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096

Cheers,
Jeff

2010-04-15 14:03:50

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 0/4 v3] ext3/4: enhance fsync performance when using CFQ

On Thu, Apr 15 2010, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
> >> > This is starting to look better. Can you share what tests you did? I
> >> > tried reproducing with fs_mark last time and could not.
> >>
> >> Did you use the fs_mark command line I (think I) had posted? What
> >> storage were you using?
> >
> > No, I didn't see any references to example command lines. I tested on a
> > few single disks, rotating and SSD. I expected the single spinning disk
> > to show the problem to some extent at least, but there was no difference
> > observed with 64kb blocks.
>
> Boy, I'm really slipping. Try this one:
>
> ./fs_mark -S 1 -D 100 -N 1000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096

Thanks Jeff, I'll give it a spin :-)

--
Jens Axboe

2010-04-15 15:49:27

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/4] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.

Jens Axboe <[email protected]> writes:

> On Wed, Apr 14 2010, Vivek Goyal wrote:

>> > @@ -1887,6 +1891,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
>> >
>> > cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
>> > cfqq->nr_sectors += blk_rq_sectors(rq);
>> > +
>> > + if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
>> > + cfq_yield_cfqq(cfqd, cfqq);
>>
>> Jeff,
>>
>> I am wondering if cfq_select_queue() will be a better place for yielding
>> the queue.
>>
>> if (cfq_cfqq_yield(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
>> goto expire;
>>
>> We can avoid one unnecessary __blk_run_queue().
>
> Agree, doing it on insert is not the right place.

I see where you're coming from, but that makes things quite a bit
trickier. I look forward to the review of *that* patch. ;-)

Cheers,
Jeff