LinuxLists.cc - [PATCH 0/6 v6][RFC] jbd[2]: enhance fsync performance when using CFQ

2010-07-02 19:58:38

Subject: [PATCH 0/6 v6][RFC] jbd[2]: enhance fsync performance when using CFQ

Hi,

Running iozone or fs_mark with fsync enabled, the performance of CFQ is
far worse than that of deadline for enterprise class storage when dealing
with file sizes of 8MB or less. I used the following command line as a
representative test case:

fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F

When run using the deadline I/O scheduler, an average of the first 5 numbers
will give you 529.44 files / second. CFQ will yield only 106.7.

Because the iozone process is issuing synchronous writes, it is put
onto CFQ's SYNC service tree. The significance of this is that CFQ
will idle for up to 8ms waiting for requests on such queues. So,
what happens is that the iozone process will issue, say, 64KB worth
of write I/O. That I/O will just land in the page cache. Then, the
iozone process does an fsync which forces those I/Os to disk as
synchronous writes. Then, the file system's fsync method is invoked,
and for ext3/4, it calls log_start_commit followed by log_wait_commit.
Because those synchronous writes were forced out in the context of the
iozone process, CFQ will now idle on iozone's cfqq waiting for more I/O.
However, iozone's progress is gated by the journal thread, now.

With this patch series applied (in addition to the two other patches I
sent [1]), CFQ now achieves 530.82 files / second.

I also wanted to improve the performance of the fsync-ing process in the
presence of a competing sequential reader. The workload I used for that
was a fio job that did sequential buffered 4k reads while running the fs_mark
process. The run-time was 30 seconds, except where otherwise noted.

Deadline got 450 files/second while achieving a throughput of 78.2 MB/s for
the sequential reader. CFQ, unpatched, did not finish an fs_mark run
in 30 seconds. I had to bump the time of the test up to 5 minutes, and then
CFQ saw an fs_mark performance of 6.6 files/second and sequential reader
throughput of 137.2MB/s.

The fs_mark process was being starved as the WRITE_SYNC I/O is marked
with RQ_NOIDLE, and regular WRITES are part of the async workload by
default. So, a single request would be served from either the fs_mark
process or the journal thread, and then they would give up the I/O
scheduler.

After applying this patch set, CFQ can now perform 113.2 files/second while
achieving a throughput of 78.6 MB/s for the sequential reader. In table
form, the results (all averages of 5 runs) look like this:

just just
fs_mark fio mixed
-------------------------------+--------------
deadline 529.44 151.4 | 450.0 78.2
vanilla cfq 107.88 164.4 | 6.6 137.2
patched cfq 530.82 158.7 | 113.2 78.6

While this is a huge jump for CFQ, it is still nowhere near competing with
deadline. I'm not sure what else I can do in this approach to address
that problem. I/O from the two streams really needs to be interleaved in
order to keep the storage busy.

Comments, as always, are appreciated. I think I may have explored this
alternative as far as is desirable, so if this is not a preferred method
of dealing with the problem, I'm all ears for new approaches.

Thanks!
Jeff

---

Changes from the last posting:
- Yielding no longer expires the current queue. Instead, it sets up new
requests from the target process so that they are issued in the yielding
process' cfqq. This means that we don't need to worry about losing group
or workload share.
- Journal commits are now synchronous I/Os, which was required to get any
sort of performance out of the fs_mark process in the presence of a
competing reader.
- WRITE_SYNC I/O no longer sets RQ_NOIDLE, for a similar reason.
- I did test OCFS2, and it does experience performance improvements, though
I forgot to record those.

Previous postings can be found here:
http://lkml.org/lkml/2010/4/1/344
http://lkml.org/lkml/2010/4/7/325
http://lkml.org/lkml/2010/4/14/394
http://lkml.org/lkml/2010/5/18/365
http://lkml.org/lkml/2010/6/22/338

[1] http://lkml.org/lkml/2010/6/21/307

[PATCH 1/6] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.
[PATCH 2/6] jbd: yield the device queue when waiting for commits
[PATCH 3/6] jbd2: yield the device queue when waiting for journal commits
[PATCH 4/6] jbd: use WRITE_SYNC for journal I/O
[PATCH 5/6] jbd2: use WRITE_SYNC for journal I/O
[PATCH 6/6] block: remove RQ_NOIDLE from WRITE_SYNC

2010-07-02 19:58:41

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 4/6] jbd: use WRITE_SYNC for journal I/O

In my fsync testing, journal I/O most definitely was sync I/O, since
another process was blocked waiting for the results. By marking all
journal I/O as WRITE_SYNC, I can get better performance with CFQ.

If there is a way to mark this only for cases where it is blocking progress
in a dependent process, then that would be preferrable. Is there such a
means for determining and flagging this?

Cheers,
Jeff

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd/commit.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 28a9dda..d97a0c6 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -317,7 +317,7 @@ void journal_commit_transaction(journal_t *journal)
int first_tag = 0;
int tag_flag;
int i;
- int write_op = WRITE;
+ int write_op = WRITE_SYNC;

/*
* First job: lock down the current transaction and wait for
--
1.6.5.2

2010-07-02 19:58:39

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 3/6] jbd2: yield the device queue when waiting for journal commits

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd2/journal.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index bc2ff59..6b5bf0f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -41,6 +41,7 @@
#include <linux/hash.h>
#include <linux/log2.h>
#include <linux/vmalloc.h>
+#include <linux/blkdev.h>

#define CREATE_TRACE_POINTS
#include <trace/events/jbd2.h>
@@ -192,6 +193,7 @@ loop:
should_sleep = 0;
if (should_sleep) {
spin_unlock(&journal->j_state_lock);
+ blk_yield(journal->j_dev->bd_disk->queue, NULL);
schedule();
spin_lock(&journal->j_state_lock);
}
@@ -580,6 +582,11 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.5.2

2010-07-02 19:58:36

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 2/6] jbd: yield the device queue when waiting for commits

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd/journal.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 93d1e47..7bf7528 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/blkdev.h>

#include <asm/uaccess.h>
#include <asm/page.h>
@@ -183,6 +184,7 @@ loop:
should_sleep = 0;
if (should_sleep) {
spin_unlock(&journal->j_state_lock);
+ blk_yield(journal->j_dev->bd_disk->queue, NULL);
schedule();
spin_lock(&journal->j_state_lock);
}
@@ -549,6 +551,11 @@ int log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.5.2

2010-07-02 19:59:31

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 5/6] jbd2: use WRITE_SYNC for journal I/O

In my fsync testing, journal I/O most definitely was sync I/O, since
another process was blocked waiting for the results. By marking all
journal I/O as WRITE_SYNC, I can get better performance with CFQ.

If there is a way to mark this only for cases where it is blocking progress
in a dependent process, then that would be preferrable. Is there such a
means for determining and flagging this?

Cheers,
Jeff

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd2/commit.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 75716d3..a078744 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -369,7 +369,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_bytes = journal_tag_bytes(journal);
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
- int write_op = WRITE;
+ int write_op = WRITE_SYNC;

/*
* First job: lock down the current transaction and wait for
--
1.6.5.2

2010-07-02 19:59:29

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 6/6] block: remove RQ_NOIDLE from WRITE_SYNC

In trying to get fsync-ing processes to perform as well under CFQ as they
do in deadline, I found (with the current blk_yield approach) that it was
necessary to get rid of the RQ_NOIDLE flag for WRITE_SYNC I/O. Instead,
we do explicit yielding of the I/O scheduler.

Comments, as always, are greatly appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <[email protected]>
---
include/linux/fs.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..da34297 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -152,12 +152,12 @@ struct inodes_stat_t {
#define SWRITE 3 /* for ll_rw_block() - wait for buffer lock */
#define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define READ_META (READ | (1 << BIO_RW_META))
-#define WRITE_SYNC_PLUG (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
+#define WRITE_SYNC_PLUG (WRITE | (1 << BIO_RW_SYNCIO))
#define WRITE_SYNC (WRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
#define WRITE_ODIRECT_PLUG (WRITE | (1 << BIO_RW_SYNCIO))
#define WRITE_META (WRITE | (1 << BIO_RW_META))
#define SWRITE_SYNC_PLUG \
- (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
+ (SWRITE | (1 << BIO_RW_SYNCIO))
#define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))

--
1.6.5.2

2010-07-02 20:00:10

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 1/6] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.

This patch implements a blk_yield function to allow a process to voluntarily
give up its I/O scheduler time slice. This is desirable for those processes
which know that they will be blocked on I/O from another process, such as
the file system journal thread. The yield call works by causing the target
process to issue I/O in the context of the cfqq of the calling process.
Following patches will put calls to blk_yield into jbd and jbd2.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/blk-core.c | 24 +++++++
block/blk-ioc.c | 1 +
block/blk-settings.c | 6 ++
block/cfq-iosched.c | 147 +++++++++++++++++++++++++++++++++++++++++++-
block/elevator.c | 15 +++++
include/linux/blkdev.h | 4 +
include/linux/elevator.h | 3 +
include/linux/iocontext.h | 3 +
8 files changed, 199 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f84cce4..e9530eb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -324,6 +324,29 @@ void blk_unplug(struct request_queue *q)
}
EXPORT_SYMBOL(blk_unplug);

+static void generic_yield_iosched(struct request_queue *q,
+ struct task_struct *tsk)
+{
+ elv_yield(q, tsk);
+}
+
+/**
+ * blk_yield()
+ * @q: request_queue to which we're doing I/O
+ * @tsk: task to which we're yielding the I/O scheduler
+ *
+ * This function should be called by code which knows that it is waiting
+ * on another thread to perform I/O in order for it to make progress. By
+ * yielding the I/O scheduler, a potentially significant idling window can
+ * be bypassed, resulting in better latency and throughput.
+ */
+void blk_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->yield_fn)
+ q->yield_fn(q, tsk);
+}
+EXPORT_SYMBOL(blk_yield);
+
/**
* blk_start_queue - restart a previously stopped queue
* @q: The &struct request_queue in question
@@ -609,6 +632,7 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
q->request_fn = rfn;
q->prep_rq_fn = NULL;
q->unplug_fn = generic_unplug_device;
+ q->yield_fn = generic_yield_iosched;
q->queue_flags = QUEUE_FLAG_DEFAULT;
q->queue_lock = lock;

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d22c4c5..3a7b507 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -97,6 +97,7 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(&ret->cic_list);
ret->ioc_data = NULL;
+ ret->on_behalf_of = NULL;
}

return ret;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f5ed5a1..1353767 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -171,6 +171,12 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
}
EXPORT_SYMBOL(blk_queue_make_request);

+void blk_queue_yield_set(struct request_queue *q, yield_fn *yield)
+{
+ q->yield_fn = yield;
+}
+EXPORT_SYMBOL_GPL(blk_queue_yield_set);
+
/**
* blk_queue_bounce_limit - set bounce buffer limit for queue
* @q: the request queue for the device
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dab836e..00b14d4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -87,9 +87,19 @@ struct cfq_rb_root {
unsigned total_weight;
u64 min_vdisktime;
struct rb_node *active;
+ /*
+ * The following two fields are used only for the SYNC_NOIDLE
+ * service tree. Taken together, they are used to determine
+ * whether or not there is currently a dependent reader doing
+ * I/O on this service tree. last_expiry records the last time
+ * that a queue was expired in this service tree, and last_pid
+ * tells which cfqq->pid it was that was expired.
+ */
+ unsigned long last_expiry;
+ pid_t last_pid;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
- .count = 0, .min_vdisktime = 0, }
+ .count = 0, .min_vdisktime = 0, .last_pid = (pid_t)-1, }

/*
* Per process-grouping structure
@@ -147,6 +157,7 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ struct io_context *yield_to, *yield_from;
};

/*
@@ -318,6 +329,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_yield, /* Allow another cfqq to run */
};

#define CFQ_CFQQ_FNS(name) \
@@ -347,6 +359,7 @@ CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(split_coop);
CFQ_CFQQ_FNS(deep);
CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(yield);
#undef CFQ_CFQQ_FNS

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1594,6 +1607,9 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfq_mark_cfqq_slice_new(cfqq);

cfq_del_timer(cfqd, cfqq);
+
+ if (cfqq->yield_to)
+ cfqq->yield_to->on_behalf_of = cfqq->yield_from;
}

cfqd->active_queue = cfqq;
@@ -1614,6 +1630,18 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_wait_busy(cfqq);

+ if (cfq_cfqq_yield(cfqq)) {
+ if (cfqq->yield_to) {
+ cfqq->yield_to->on_behalf_of = NULL;
+ put_io_context(cfqq->yield_to);
+ cfqq->yield_to = cfqq->yield_from = NULL;
+ }
+ cfq_clear_cfqq_yield(cfqq);
+ } else {
+ cfqq->service_tree->last_expiry = jiffies;
+ cfqq->service_tree->last_pid = cfqq->pid;
+ }
+
/*
* If this cfqq is shared between multiple processes, check to
* make sure that those processes are still issuing I/Os within
@@ -2118,7 +2146,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
slice = max(slice, 2 * cfqd->cfq_slice_idle);

slice = max_t(unsigned, slice, CFQ_MIN_TT);
- cfq_log(cfqd, "workload slice:%d", slice);
+ cfq_log(cfqd, "workload:%d slice:%d", cfqd->serving_type, slice);
cfqd->workload_expires = jiffies + slice;
cfqd->noidle_tree_requires_idle = false;
}
@@ -2241,6 +2269,96 @@ keep_queue:
return cfqq;
}

+static int expiry_data_valid(struct cfq_rb_root *service_tree)
+{
+ return (service_tree->last_pid != (pid_t)-1 &&
+ service_tree->last_expiry != 0UL);
+}
+
+static bool cfq_should_yield_now(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+
+ if (cfqq != cfqd->active_queue)
+ return false;
+
+ if (cfqd->serving_type != SYNC_NOIDLE_WORKLOAD)
+ return true;
+
+ /*
+ * This is the sync-noidle workload. If there is a dependent reader
+ * executing now, then we should not allow yielding.
+ */
+ if (expiry_data_valid(cfqq->service_tree) &&
+ time_before(cfqq->service_tree->last_expiry +
+ cfq_slice_idle, jiffies) &&
+ cfqq->service_tree->last_pid != cfqq->pid)
+ return false;
+
+ return true;
+}
+
+/*
+ * Explicitly give up this (sync) cfqq's time slice to the specified
+ * task. This is currently used by the journal code when it is waiting
+ * on the jbd[2] thread to issue I/O on its behalf.
+ */
+static void cfq_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_io_context *cic;
+ struct cfq_queue *cfqq;
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq) {
+ spin_unlock_irq(q->queue_lock);
+ return;
+ }
+
+ if (tsk) {
+ task_lock(tsk);
+ /*
+ * If the task hasn't yet performed any I/O, then it
+ * will have no io_context. We can't create one for
+ * another task, so just don't yield the queue in this
+ * corner case.
+ */
+ if (!tsk->io_context) {
+ task_unlock(tsk);
+ goto out_unlock;
+ }
+ atomic_long_inc(&tsk->io_context->refcount);
+ cfqq->yield_to = tsk->io_context;
+ cfqq->yield_from = current->io_context;
+ task_unlock(tsk);
+ } else {
+ if (cfq_should_yield_now(cfqd, cfqq)) {
+ __cfq_slice_expired(cfqd, cfqq, 0);
+ cfq_schedule_dispatch(cfqd);
+ } else
+ cfq_mark_cfqq_yield(cfqq);
+ goto out_unlock;
+ }
+
+ cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
+ cfq_mark_cfqq_yield(cfqq);
+ if (cfqd->active_queue == cfqq)
+ tsk->io_context->on_behalf_of = current->io_context;
+
+ spin_unlock_irq(q->queue_lock);
+ return;
+
+out_unlock:
+ spin_unlock_irq(q->queue_lock);
+ if (tsk)
+ put_io_context(tsk->io_context);
+}
+
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
{
int dispatched = 0;
@@ -3010,6 +3128,13 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
if (!ioc)
return NULL;

+ if (ioc->on_behalf_of) {
+ struct io_context *old_ioc = ioc;
+ ioc = ioc->on_behalf_of;
+ put_io_context(old_ioc);
+ atomic_long_inc(&ioc->refcount);
+ }
+
cic = cfq_cic_lookup(cfqd, ioc);
if (cic)
goto out;
@@ -3319,6 +3444,9 @@ static bool cfq_should_wait_busy(struct cfq_data *cfqd, struct cfq_queue *cfqq)
if (cfqq->cfqg->nr_cfqq > 1)
return false;

+ if (cfq_cfqq_yield(cfqq))
+ return false;
+
if (cfq_slice_used(cfqq))
return true;

@@ -3401,7 +3529,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
cfq_slice_expired(cfqd, 1);
else if (sync && cfqq_empty &&
- !cfq_close_cooperator(cfqd, cfqq)) {
+ !cfq_close_cooperator(cfqd, cfqq) &&
+ (!cfq_cfqq_yield(cfqq) ||
+ (cfq_cfqq_yield(cfqq) && cfqq->yield_to))) {
+
cfqd->noidle_tree_requires_idle |= !rq_noidle(rq);
/*
* Idling is enabled for SYNC_WORKLOAD.
@@ -3548,7 +3679,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_io_context *cic;
const int rw = rq_data_dir(rq);
- const bool is_sync = rq_is_sync(rq);
+ bool is_sync = rq_is_sync(rq);
struct cfq_queue *cfqq;
unsigned long flags;

@@ -3561,6 +3692,13 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
if (!cic)
goto queue_fail;

+ /*
+ * If another process called blk_yield specifying us as the target,
+ * then we issue I/O via their sync cfqq.
+ */
+ if (current->io_context->on_behalf_of)
+ is_sync = 1;
+
new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
@@ -3973,6 +4111,7 @@ static struct elevator_type iosched_cfq = {
.elevator_deactivate_req_fn = cfq_deactivate_request,
.elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
+ .elevator_yield_fn = cfq_yield,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_set_req_fn = cfq_set_request,
diff --git a/block/elevator.c b/block/elevator.c
index 923a913..aa3c326 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -866,6 +866,21 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
}
}

+/**
+ * elv_yield() - explicitly give up the I/O scheduler
+ * @q: request_queue for the device to which we're doing I/O
+ * @tsk: task_struct of the process to which we're yielding
+ *
+ * This function abstracts out the I/O scheduler's yield function.
+ */
+void elv_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->ops->elevator_yield_fn)
+ e->ops->elevator_yield_fn(q, tsk);
+}
+
#define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)

static ssize_t
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 09a8402..ef2d10c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -263,6 +263,7 @@ struct request_pm_state

typedef void (request_fn_proc) (struct request_queue *q);
typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
+typedef void (yield_fn) (struct request_queue *q, struct task_struct *tsk);
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unplug_fn) (struct request_queue *);

@@ -345,6 +346,7 @@ struct request_queue

request_fn_proc *request_fn;
make_request_fn *make_request_fn;
+ yield_fn *yield_fn;
prep_rq_fn *prep_rq_fn;
unplug_fn *unplug_fn;
merge_bvec_fn *merge_bvec_fn;
@@ -837,6 +839,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
struct request *, int, rq_end_io_fn *);
extern void blk_unplug(struct request_queue *q);
+extern void blk_yield(struct request_queue *q, struct task_struct *tsk);

static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
{
@@ -929,6 +932,7 @@ extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
request_fn_proc *, spinlock_t *);
extern void blk_cleanup_queue(struct request_queue *);
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
+extern void blk_queue_yield_set(struct request_queue *, yield_fn *);
extern void blk_queue_bounce_limit(struct request_queue *, u64);
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c958f4..a68b5b1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,6 +23,7 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_queue_empty_fn) (struct request_queue *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
+typedef void (elevator_yield_fn) (struct request_queue *, struct task_struct *tsk);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
@@ -48,6 +49,7 @@ struct elevator_ops

elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
+ elevator_yield_fn *elevator_yield_fn;

elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
@@ -111,6 +113,7 @@ extern void elv_bio_merged(struct request_queue *q, struct request *,
struct bio *);
extern void elv_requeue_request(struct request_queue *, struct request *);
extern int elv_queue_empty(struct request_queue *);
+extern void elv_yield(struct request_queue *, struct task_struct *);
extern struct request *elv_former_request(struct request_queue *, struct request *);
extern struct request *elv_latter_request(struct request_queue *, struct request *);
extern int elv_register_queue(struct request_queue *q);
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 64d5291..1e3e578 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -54,6 +54,9 @@ struct io_context {
struct radix_tree_root radix_root;
struct hlist_head cic_list;
void *ioc_data;
+ /* set when another process has yielded its I/O scheduler slice to
+ * this process */
+ struct io_context *on_behalf_of;
};

static inline struct io_context *ioc_task_link(struct io_context *ioc)
--
1.6.5.2

2010-07-02 20:23:50

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH 0/6 v6][RFC] jbd[2]: enhance fsync performance when using CFQ

On Fri, Jul 02, 2010 at 03:58:13PM -0400, Jeff Moyer wrote:

[..]
> Changes from the last posting:
> - Yielding no longer expires the current queue. Instead, it sets up new
> requests from the target process so that they are issued in the yielding
> process' cfqq. This means that we don't need to worry about losing group
> or workload share.
> - Journal commits are now synchronous I/Os, which was required to get any
> sort of performance out of the fs_mark process in the presence of a
> competing reader.
> - WRITE_SYNC I/O no longer sets RQ_NOIDLE, for a similar reason.

Hi Jeff,

So this patchset relies on idling on WRITE_SYNC queues. Though in general
we don't have examples that why one should idle on processes doing WRITE_SYNC
IO because previous IO does not tell anything about the upcoming IO. I am
bringing up this point again to make sure that fundamentally we agree that
continue to idle on WRITE_SYNC is the right thing to do otherwise this patch
will fall apart.

I have yet to go through the patch in detail but allowing other queue to
dispatch requests in the same queue sounds like queue merging. So can
we use that semantics to say elv_merge_context() or elv_merge_queue()
instead of elv_yield(). In the code we can just merge the two queues when
the next request comes in and separate them out at the slice expiry I
guess.

Thanks
Vivek

> - I did test OCFS2, and it does experience performance improvements, though
> I forgot to record those.
>
> Previous postings can be found here:
> http://lkml.org/lkml/2010/4/1/344
> http://lkml.org/lkml/2010/4/7/325
> http://lkml.org/lkml/2010/4/14/394
> http://lkml.org/lkml/2010/5/18/365
> http://lkml.org/lkml/2010/6/22/338
>
> [1] http://lkml.org/lkml/2010/6/21/307
>
> [PATCH 1/6] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.
> [PATCH 2/6] jbd: yield the device queue when waiting for commits
> [PATCH 3/6] jbd2: yield the device queue when waiting for journal commits
> [PATCH 4/6] jbd: use WRITE_SYNC for journal I/O
> [PATCH 5/6] jbd2: use WRITE_SYNC for journal I/O
> [PATCH 6/6] block: remove RQ_NOIDLE from WRITE_SYNC

2010-07-02 20:32:41

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 0/6 v6][RFC] jbd[2]: enhance fsync performance when using CFQ

Vivek Goyal <[email protected]> writes:

> On Fri, Jul 02, 2010 at 03:58:13PM -0400, Jeff Moyer wrote:
>
> [..]
>> Changes from the last posting:
>> - Yielding no longer expires the current queue. Instead, it sets up new
>> requests from the target process so that they are issued in the yielding
>> process' cfqq. This means that we don't need to worry about losing group
>> or workload share.
>> - Journal commits are now synchronous I/Os, which was required to get any
>> sort of performance out of the fs_mark process in the presence of a
>> competing reader.
>> - WRITE_SYNC I/O no longer sets RQ_NOIDLE, for a similar reason.
>
> Hi Jeff,
>
> So this patchset relies on idling on WRITE_SYNC queues. Though in general
> we don't have examples that why one should idle on processes doing WRITE_SYNC
> IO because previous IO does not tell anything about the upcoming IO. I am
> bringing up this point again to make sure that fundamentally we agree that
> continue to idle on WRITE_SYNC is the right thing to do otherwise this patch
> will fall apart.

I think a mail server would be an example of an application that might
do this. I'll see if I can get a real world test case (or perhaps some
real world data) and verify that.

I agree that if we choose not to idle on write's, then this approach can
be thrown out the window.

> I have yet to go through the patch in detail but allowing other queue to
> dispatch requests in the same queue sounds like queue merging. So can
> we use that semantics to say elv_merge_context() or elv_merge_queue()
> instead of elv_yield(). In the code we can just merge the two queues when
> the next request comes in and separate them out at the slice expiry I
> guess.

I considered that approach, but then you run into all of the questions
about losing fairness across workloads and across groups. I believe the
approach I've taken here is *significantly* simpler than merging and
unmerging would be.

Cheers,
Jeff

2010-07-06 06:29:36

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH 0/6 v6][RFC] jbd[2]: enhance fsync performance when using CFQ

Hi Jeff,

On 07/03/2010 03:58 AM, Jeff Moyer wrote:
> Hi,
>
> Running iozone or fs_mark with fsync enabled, the performance of CFQ is
> far worse than that of deadline for enterprise class storage when dealing
> with file sizes of 8MB or less. I used the following command line as a
> representative test case:
>
> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
>
I ran the script with "35-rc4 + this patch version" for an ocfs2 volume,
and get no hang now. Thanks for the work. I also have some number for
you. See below.
>
> Because the iozone process is issuing synchronous writes, it is put
> onto CFQ's SYNC service tree. The significance of this is that CFQ
> will idle for up to 8ms waiting for requests on such queues. So,
> what happens is that the iozone process will issue, say, 64KB worth
> of write I/O. That I/O will just land in the page cache. Then, the
> iozone process does an fsync which forces those I/Os to disk as
> synchronous writes. Then, the file system's fsync method is invoked,
> and for ext3/4, it calls log_start_commit followed by log_wait_commit.
> Because those synchronous writes were forced out in the context of the
> iozone process, CFQ will now idle on iozone's cfqq waiting for more I/O.
> However, iozone's progress is gated by the journal thread, now.
>
> With this patch series applied (in addition to the two other patches I
> sent [1]), CFQ now achieves 530.82 files / second.
>
> I also wanted to improve the performance of the fsync-ing process in the
> presence of a competing sequential reader. The workload I used for that
> was a fio job that did sequential buffered 4k reads while running the fs_mark
> process. The run-time was 30 seconds, except where otherwise noted.
>
> Deadline got 450 files/second while achieving a throughput of 78.2 MB/s for
> the sequential reader. CFQ, unpatched, did not finish an fs_mark run
> in 30 seconds. I had to bump the time of the test up to 5 minutes, and then
> CFQ saw an fs_mark performance of 6.6 files/second and sequential reader
> throughput of 137.2MB/s.
>
> The fs_mark process was being starved as the WRITE_SYNC I/O is marked
> with RQ_NOIDLE, and regular WRITES are part of the async workload by
> default. So, a single request would be served from either the fs_mark
> process or the journal thread, and then they would give up the I/O
> scheduler.
>
> After applying this patch set, CFQ can now perform 113.2 files/second while
> achieving a throughput of 78.6 MB/s for the sequential reader. In table
> form, the results (all averages of 5 runs) look like this:
>
> just just
> fs_mark fio mixed
> -------------------------------+--------------
> deadline 529.44 151.4 | 450.0 78.2
> vanilla cfq 107.88 164.4 | 6.6 137.2
> patched cfq 530.82 158.7 | 113.2 78.6
Just some updates from the test of ocfs2.
fs_mark
------------------------
deadline 386.3
vanilla cfq 59.7
patched cfq 366.2

So there is really a fantastic improvement at least from what fs_mark
gives us. Great thanks.

Regards,
Tao