Hi,
Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
far worse than that of deadline for enterprise class storage when dealing
with file sizes of 8MB or less. I used the following command line as a
representative test case:
fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
When run using the deadline I/O scheduler, an average of the first 5 numbers
will give you 448.4 files / second. CFQ will yield only 106.7. With
this patch series applied (and the two patches I sent yesterday), CFQ now
achieves 462.5 files / second.
This patch set is still an RFC. I'd like to make it perform better when
there is a competing sequential reader present. For now, I've addressed
the concerns voiced about the previous posting.
Review and testing would be greatly appreciated.
Thanks!
Jeff
---
New from the last round:
- removed the think time calculation I added for the sync-noidle service tree
- replaced above with a suggestion from Vivek to only guard against currently
active sequential readers when determining if we can preempt the sync-noidle
service tree.
- bug fixes
Over all, I think it's simpler now thanks to the suggestions from Jens and
Vivek.
[PATCH 1/3] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.
[PATCH 2/3] jbd: yield the device queue when waiting for commits
[PATCH 3/3] jbd2: yield the device queue when waiting for journal commits
This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.
Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd/journal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 93d1e47..9b6cf4c 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/blkdev.h>
#include <asm/uaccess.h>
#include <asm/page.h>
@@ -549,6 +550,11 @@ int log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.5.2
This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.
Signed-off-by: Jeff Moyer <[email protected]>
---
fs/jbd2/journal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index bc2ff59..aba4754 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -41,6 +41,7 @@
#include <linux/hash.h>
#include <linux/log2.h>
#include <linux/vmalloc.h>
+#include <linux/blkdev.h>
#define CREATE_TRACE_POINTS
#include <trace/events/jbd2.h>
@@ -580,6 +581,11 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
+ /*
+ * Give up our I/O scheduler time slice to allow the journal
+ * thread to issue I/O.
+ */
+ blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
wake_up(&journal->j_wait_commit);
spin_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
--
1.6.5.2
This patch implements a blk_yield to allow a process to voluntarily
give up its I/O scheduler time slice. This is desirable for those processes
which know that they will be blocked on I/O from another process, such as
the file system journal thread. Following patches will put calls to blk_yield
into jbd and jbd2.
Signed-off-by: Jeff Moyer <[email protected]>
---
block/blk-core.c | 13 +++++
block/blk-settings.c | 6 ++
block/cfq-iosched.c | 123 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator.c | 8 +++
include/linux/blkdev.h | 4 ++
include/linux/elevator.h | 3 +
6 files changed, 155 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index f84cce4..b9afbba 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -324,6 +324,18 @@ void blk_unplug(struct request_queue *q)
}
EXPORT_SYMBOL(blk_unplug);
+void generic_yield_iosched(struct request_queue *q, struct task_struct *tsk)
+{
+ elv_yield(q, tsk);
+}
+
+void blk_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->yield_fn)
+ q->yield_fn(q, tsk);
+}
+EXPORT_SYMBOL(blk_yield);
+
/**
* blk_start_queue - restart a previously stopped queue
* @q: The &struct request_queue in question
@@ -609,6 +621,7 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
q->request_fn = rfn;
q->prep_rq_fn = NULL;
q->unplug_fn = generic_unplug_device;
+ q->yield_fn = generic_yield_iosched;
q->queue_flags = QUEUE_FLAG_DEFAULT;
q->queue_lock = lock;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f5ed5a1..fe548c9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -171,6 +171,12 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
}
EXPORT_SYMBOL(blk_queue_make_request);
+void blk_queue_yield(struct request_queue *q, yield_fn *yield)
+{
+ q->yield_fn = yield;
+}
+EXPORT_SYMBOL_GPL(blk_queue_yield);
+
/**
* blk_queue_bounce_limit - set bounce buffer limit for queue
* @q: the request queue for the device
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dab836e..a9922b9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -87,9 +87,12 @@ struct cfq_rb_root {
unsigned total_weight;
u64 min_vdisktime;
struct rb_node *active;
+ unsigned long last_expiry;
+ pid_t last_pid;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
- .count = 0, .min_vdisktime = 0, }
+ .count = 0, .min_vdisktime = 0, .last_expiry = 0UL, \
+ .last_pid = (pid_t)-1, }
/*
* Per process-grouping structure
@@ -147,6 +150,7 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ struct cfq_io_context *yield_to;
};
/*
@@ -318,6 +322,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_yield, /* Allow another cfqq to run */
};
#define CFQ_CFQQ_FNS(name) \
@@ -347,6 +352,7 @@ CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(split_coop);
CFQ_CFQQ_FNS(deep);
CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(yield);
#undef CFQ_CFQQ_FNS
#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1614,6 +1620,15 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_wait_busy(cfqq);
+ if (!cfq_cfqq_yield(cfqq)) {
+ struct cfq_rb_root *st;
+ st = service_tree_for(cfqq->cfqg,
+ cfqq_prio(cfqq), cfqq_type(cfqq));
+ st->last_expiry = jiffies;
+ st->last_pid = cfqq->pid;
+ }
+ cfq_clear_cfqq_yield(cfqq);
+
/*
* If this cfqq is shared between multiple processes, check to
* make sure that those processes are still issuing I/Os within
@@ -2118,7 +2133,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
slice = max(slice, 2 * cfqd->cfq_slice_idle);
slice = max_t(unsigned, slice, CFQ_MIN_TT);
- cfq_log(cfqd, "workload slice:%d", slice);
+ cfq_log(cfqd, "workload:%d slice:%d", cfqd->serving_type, slice);
cfqd->workload_expires = jiffies + slice;
cfqd->noidle_tree_requires_idle = false;
}
@@ -2153,6 +2168,36 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
choose_service_tree(cfqd, cfqg);
}
+static int cfq_should_yield_now(struct cfq_queue *cfqq,
+ struct cfq_queue **yield_to)
+{
+ struct cfq_queue *new_cfqq;
+
+ new_cfqq = cic_to_cfqq(cfqq->yield_to, 1);
+
+ /*
+ * If the queue we're yielding to is in a different cgroup,
+ * just expire our own time slice.
+ */
+ if (new_cfqq->cfqg != cfqq->cfqg) {
+ *yield_to = NULL;
+ return 1;
+ }
+
+ /*
+ * If the new queue has pending I/O, then switch to it
+ * immediately. Otherwise, see if we can idle until it is
+ * ready to preempt us.
+ */
+ if (!RB_EMPTY_ROOT(&new_cfqq->sort_list)) {
+ *yield_to = new_cfqq;
+ return 1;
+ }
+
+ *yield_to = NULL;
+ return 0;
+}
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -2187,6 +2232,10 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
* have been idling all along on this queue and it should be
* ok to wait for this request to complete.
*/
+ if (cfq_cfqq_yield(cfqq) &&
+ cfq_should_yield_now(cfqq, &new_cfqq))
+ goto expire;
+
if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
&& cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
cfqq = NULL;
@@ -2215,6 +2264,9 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
goto expire;
}
+ if (cfq_cfqq_yield(cfqq) && cfq_should_yield_now(cfqq, &new_cfqq))
+ goto expire;
+
/*
* No requests pending. If the active queue still has requests in
* flight or is idling for a new request, allow either of these
@@ -2241,6 +2293,65 @@ keep_queue:
return cfqq;
}
+static inline int expiry_data_valid(struct cfq_rb_root *service_tree)
+{
+ return (service_tree->last_pid != (pid_t)-1 &&
+ service_tree->last_expiry != 0UL);
+}
+
+static void cfq_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_io_context *cic, *new_cic;
+ struct cfq_queue *cfqq;
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ task_lock(tsk);
+ new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
+ atomic_long_inc(&tsk->io_context->refcount);
+ task_unlock(tsk);
+ if (!new_cic)
+ goto out_dec;
+
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq)
+ goto out_unlock;
+
+ /*
+ * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
+ * are idling on the last queue in that workload, *and* there are no
+ * potential dependent readers running currently, then go ahead and
+ * yield the queue.
+ */
+ if (cfqd->active_queue == cfqq &&
+ cfqd->serving_type == SYNC_NOIDLE_WORKLOAD) {
+ /*
+ * If there's been no I/O from another process in the idle
+ * slice time, then there is by definition no dependent
+ * read going on for this service tree.
+ */
+ if (expiry_data_valid(cfqq->service_tree) &&
+ time_before(cfqq->service_tree->last_expiry +
+ cfq_slice_idle, jiffies) &&
+ cfqq->service_tree->last_pid != cfqq->pid)
+ goto out_unlock;
+ }
+
+ cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
+ cfqq->yield_to = new_cic;
+ cfq_mark_cfqq_yield(cfqq);
+
+out_unlock:
+ spin_unlock_irq(q->queue_lock);
+out_dec:
+ put_io_context(tsk->io_context);
+}
+
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
{
int dispatched = 0;
@@ -3123,6 +3234,13 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (!cfqq)
return false;
+ /*
+ * If the active queue yielded its timeslice to this queue, let
+ * it preempt.
+ */
+ if (cfq_cfqq_yield(cfqq) && RQ_CIC(rq) == cfqq->yield_to)
+ return true;
+
if (cfq_class_idle(new_cfqq))
return false;
@@ -3973,6 +4091,7 @@ static struct elevator_type iosched_cfq = {
.elevator_deactivate_req_fn = cfq_deactivate_request,
.elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
+ .elevator_yield_fn = cfq_yield,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_set_req_fn = cfq_set_request,
diff --git a/block/elevator.c b/block/elevator.c
index 923a913..5e33297 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -866,6 +866,14 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
}
}
+void elv_yield(struct request_queue *q, struct task_struct *tsk)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->ops->elevator_yield_fn)
+ e->ops->elevator_yield_fn(q, tsk);
+}
+
#define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
static ssize_t
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 09a8402..8d073c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -263,6 +263,7 @@ struct request_pm_state
typedef void (request_fn_proc) (struct request_queue *q);
typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
+typedef void (yield_fn) (struct request_queue *q, struct task_struct *tsk);
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unplug_fn) (struct request_queue *);
@@ -345,6 +346,7 @@ struct request_queue
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
+ yield_fn *yield_fn;
prep_rq_fn *prep_rq_fn;
unplug_fn *unplug_fn;
merge_bvec_fn *merge_bvec_fn;
@@ -837,6 +839,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
struct request *, int, rq_end_io_fn *);
extern void blk_unplug(struct request_queue *q);
+extern void blk_yield(struct request_queue *q, struct task_struct *tsk);
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
{
@@ -929,6 +932,7 @@ extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
request_fn_proc *, spinlock_t *);
extern void blk_cleanup_queue(struct request_queue *);
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
+extern void blk_queue_yield(struct request_queue *, yield_fn *);
extern void blk_queue_bounce_limit(struct request_queue *, u64);
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c958f4..a68b5b1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,6 +23,7 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_queue_empty_fn) (struct request_queue *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
+typedef void (elevator_yield_fn) (struct request_queue *, struct task_struct *tsk);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);
typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
@@ -48,6 +49,7 @@ struct elevator_ops
elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
+ elevator_yield_fn *elevator_yield_fn;
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
@@ -111,6 +113,7 @@ extern void elv_bio_merged(struct request_queue *q, struct request *,
struct bio *);
extern void elv_requeue_request(struct request_queue *, struct request *);
extern int elv_queue_empty(struct request_queue *);
+extern void elv_yield(struct request_queue *, struct task_struct *);
extern struct request *elv_former_request(struct request_queue *, struct request *);
extern struct request *elv_latter_request(struct request_queue *, struct request *);
extern int elv_register_queue(struct request_queue *q);
--
1.6.5.2
On Tue, Jun 22, 2010 at 05:34:59PM -0400, Jeff Moyer wrote:
> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
> far worse than that of deadline for enterprise class storage when dealing
> with file sizes of 8MB or less. I used the following command line as a
> representative test case:
>
> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
I'd be interested in how ocfs2 does, because we use jbd2 too.
Joel
--
Life's Little Instruction Book #139
"Never deprive someone of hope; it might be all they have."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Tue, 22 Jun 2010 17:35:00 -0400 Jeff Moyer <[email protected]> wrote:
> This patch implements a blk_yield to allow a process to voluntarily
> give up its I/O scheduler time slice. This is desirable for those processes
> which know that they will be blocked on I/O from another process, such as
> the file system journal thread. Following patches will put calls to blk_yield
> into jbd and jbd2.
>
I'm looking through this patch series trying to find the
analysis/description of the cause for this (bad) performance problem.
But all I'm seeing is implementation stuff :( It's hard to review code
with your eyes shut.
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -324,6 +324,18 @@ void blk_unplug(struct request_queue *q)
> }
> EXPORT_SYMBOL(blk_unplug);
>
> +void generic_yield_iosched(struct request_queue *q, struct task_struct *tsk)
> +{
> + elv_yield(q, tsk);
> +}
static?
>
> ...
>
> +void blk_queue_yield(struct request_queue *q, yield_fn *yield)
> +{
> + q->yield_fn = yield;
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_yield);
There's a tradition in the block layer of using truly awful identifiers
for functions-which-set-things. But there's no good reason for
retaining that tradition. blk_queue_yield_set(), perhaps.
(what name would you give an accessor which _reads_ q->yield_fn? yup,
"blk_queue_yield()". doh).
> /**
> * blk_queue_bounce_limit - set bounce buffer limit for queue
> * @q: the request queue for the device
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index dab836e..a9922b9 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -87,9 +87,12 @@ struct cfq_rb_root {
> unsigned total_weight;
> u64 min_vdisktime;
> struct rb_node *active;
> + unsigned long last_expiry;
> + pid_t last_pid;
These fields are pretty fundamental to understanding the
implementation. Some nice descriptions would be nice.
> };
> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
> - .count = 0, .min_vdisktime = 0, }
> + .count = 0, .min_vdisktime = 0, .last_expiry = 0UL, \
> + .last_pid = (pid_t)-1, }
May as well leave the 0 and NULL fields unmentioned (ie: don't do
crappy stuff because the old code did crappy stuff!)
> /*
> * Per process-grouping structure
> @@ -147,6 +150,7 @@ struct cfq_queue {
> struct cfq_queue *new_cfqq;
> struct cfq_group *cfqg;
> struct cfq_group *orig_cfqg;
> + struct cfq_io_context *yield_to;
> };
>
> /*
>
> ...
>
> +static int cfq_should_yield_now(struct cfq_queue *cfqq,
> + struct cfq_queue **yield_to)
The bool-returning function could return a bool type.
> +{
> + struct cfq_queue *new_cfqq;
> +
> + new_cfqq = cic_to_cfqq(cfqq->yield_to, 1);
> +
> + /*
> + * If the queue we're yielding to is in a different cgroup,
> + * just expire our own time slice.
> + */
> + if (new_cfqq->cfqg != cfqq->cfqg) {
> + *yield_to = NULL;
> + return 1;
> + }
> +
> + /*
> + * If the new queue has pending I/O, then switch to it
> + * immediately. Otherwise, see if we can idle until it is
> + * ready to preempt us.
> + */
> + if (!RB_EMPTY_ROOT(&new_cfqq->sort_list)) {
> + *yield_to = new_cfqq;
> + return 1;
> + }
> +
> + *yield_to = NULL;
> + return 0;
> +}
> +
> /*
> * Select a queue for service. If we have a current active queue,
> * check whether to continue servicing it, or retrieve and set a new one.
>
> ...
>
> +static inline int expiry_data_valid(struct cfq_rb_root *service_tree)
> +{
> + return (service_tree->last_pid != (pid_t)-1 &&
> + service_tree->last_expiry != 0UL);
> +}
The compiler will inline this.
> +static void cfq_yield(struct request_queue *q, struct task_struct *tsk)
-ENODESCRIPTION
> +{
> + struct cfq_data *cfqd = q->elevator->elevator_data;
> + struct cfq_io_context *cic, *new_cic;
> + struct cfq_queue *cfqq;
> +
> + cic = cfq_cic_lookup(cfqd, current->io_context);
> + if (!cic)
> + return;
> +
> + task_lock(tsk);
> + new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
> + atomic_long_inc(&tsk->io_context->refcount);
How do we know tsk has an io_context? Use get_io_context() and check
its result?
> + task_unlock(tsk);
> + if (!new_cic)
> + goto out_dec;
> +
> + spin_lock_irq(q->queue_lock);
> +
> + cfqq = cic_to_cfqq(cic, 1);
> + if (!cfqq)
> + goto out_unlock;
> +
> + /*
> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
> + * are idling on the last queue in that workload, *and* there are no
> + * potential dependent readers running currently, then go ahead and
> + * yield the queue.
> + */
Comment explains the code, but doesn't explain the *reason* for the code.
> + if (cfqd->active_queue == cfqq &&
> + cfqd->serving_type == SYNC_NOIDLE_WORKLOAD) {
> + /*
> + * If there's been no I/O from another process in the idle
> + * slice time, then there is by definition no dependent
> + * read going on for this service tree.
> + */
> + if (expiry_data_valid(cfqq->service_tree) &&
> + time_before(cfqq->service_tree->last_expiry +
> + cfq_slice_idle, jiffies) &&
> + cfqq->service_tree->last_pid != cfqq->pid)
> + goto out_unlock;
> + }
> +
> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
> + cfqq->yield_to = new_cic;
> + cfq_mark_cfqq_yield(cfqq);
> +
> +out_unlock:
> + spin_unlock_irq(q->queue_lock);
> +out_dec:
> + put_io_context(tsk->io_context);
> +}
> +
> static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
> {
> int dispatched = 0;
>
> ...
>
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -866,6 +866,14 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
> }
> }
>
> +void elv_yield(struct request_queue *q, struct task_struct *tsk)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->ops->elevator_yield_fn)
> + e->ops->elevator_yield_fn(q, tsk);
> +}
Again, no documentation. How are other programmers to know when, why
and how they should use this?
>
> ...
>
On Tue, Jun 22, 2010 at 05:34:59PM -0400, Jeff Moyer wrote:
> Hi,
>
> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
> far worse than that of deadline for enterprise class storage when dealing
> with file sizes of 8MB or less. I used the following command line as a
> representative test case:
>
> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
>
> When run using the deadline I/O scheduler, an average of the first 5 numbers
> will give you 448.4 files / second. CFQ will yield only 106.7. With
> this patch series applied (and the two patches I sent yesterday), CFQ now
> achieves 462.5 files / second.
>
> This patch set is still an RFC. I'd like to make it perform better when
> there is a competing sequential reader present. For now, I've addressed
> the concerns voiced about the previous posting.
What happened to the initial idea of just using the BIO_RW_META flag
for log writes? In the end log writes are the most important writes you
have in a journaled filesystem, and they should not be effect to any
kind of queue idling logic or other interruption. Log I/O is usually
very little (unless you use old XFS code with a worst-case directory
manipulation workload), and very latency sensitive.
Hi Jeff,
On 06/23/2010 05:34 AM, Jeff Moyer wrote:
> Hi,
>
> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
> far worse than that of deadline for enterprise class storage when dealing
> with file sizes of 8MB or less. I used the following command line as a
> representative test case:
>
> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
>
> When run using the deadline I/O scheduler, an average of the first 5 numbers
> will give you 448.4 files / second. CFQ will yield only 106.7. With
> this patch series applied (and the two patches I sent yesterday), CFQ now
> achieves 462.5 files / second.
which 2 patches? Could you paste the link or the subject? Just want to
make my test env like yours. ;)
As Joel mentioned in another mail, ocfs2 also use jbd/jbd2, so I'd like
to give it a try and give you some feedback about the test.
Regards,
Tao
>
> This patch set is still an RFC. I'd like to make it perform better when
> there is a competing sequential reader present. For now, I've addressed
> the concerns voiced about the previous posting.
>
> Review and testing would be greatly appreciated.
>
> Thanks!
> Jeff
>
> ---
>
> New from the last round:
>
> - removed the think time calculation I added for the sync-noidle service tree
> - replaced above with a suggestion from Vivek to only guard against currently
> active sequential readers when determining if we can preempt the sync-noidle
> service tree.
> - bug fixes
>
> Over all, I think it's simpler now thanks to the suggestions from Jens and
> Vivek.
>
> [PATCH 1/3] block: Implement a blk_yield function to voluntarily give up the I/O scheduler.
> [PATCH 2/3] jbd: yield the device queue when waiting for commits
> [PATCH 3/3] jbd2: yield the device queue when waiting for journal commits
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Christoph Hellwig <[email protected]> writes:
> On Tue, Jun 22, 2010 at 05:34:59PM -0400, Jeff Moyer wrote:
>> Hi,
>>
>> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
>> far worse than that of deadline for enterprise class storage when dealing
>> with file sizes of 8MB or less. I used the following command line as a
>> representative test case:
>>
>> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
>>
>> When run using the deadline I/O scheduler, an average of the first 5 numbers
>> will give you 448.4 files / second. CFQ will yield only 106.7. With
>> this patch series applied (and the two patches I sent yesterday), CFQ now
>> achieves 462.5 files / second.
>>
>> This patch set is still an RFC. I'd like to make it perform better when
>> there is a competing sequential reader present. For now, I've addressed
>> the concerns voiced about the previous posting.
>
> What happened to the initial idea of just using the BIO_RW_META flag
> for log writes? In the end log writes are the most important writes you
> have in a journaled filesystem, and they should not be effect to any
> kind of queue idling logic or other interruption. Log I/O is usually
> very little (unless you use old XFS code with a worst-case directory
> manipulation workload), and very latency sensitive.
Vivek showed that starting firefox in the presence of a processing doing
fsyncs (using the RQ_META approach) took twice as long as without the
patch:
http://lkml.org/lkml/2010/4/6/276
Cheers,
Jeff
Tao Ma <[email protected]> writes:
> Hi Jeff,
>
> On 06/23/2010 05:34 AM, Jeff Moyer wrote:
>> Hi,
>>
>> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
>> far worse than that of deadline for enterprise class storage when dealing
>> with file sizes of 8MB or less. I used the following command line as a
>> representative test case:
>>
>> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096 -F
>>
>> When run using the deadline I/O scheduler, an average of the first 5 numbers
>> will give you 448.4 files / second. CFQ will yield only 106.7. With
>> this patch series applied (and the two patches I sent yesterday), CFQ now
>> achieves 462.5 files / second.
> which 2 patches? Could you paste the link or the subject? Just want to
> make my test env like yours. ;)
> As Joel mentioned in another mail, ocfs2 also use jbd/jbd2, so I'd
> like to give it a try and give you some feedback about the test.
http://lkml.org/lkml/2010/6/21/307:
[PATCH 1/2] cfq: always return false from should_idle if slice_idle is
set to zero
[PATCH 2/2] cfq: allow dispatching of both sync and async I/O together
Thanks in advance for the testing!
-Jeff
Andrew Morton <[email protected]> writes:
> On Tue, 22 Jun 2010 17:35:00 -0400 Jeff Moyer <[email protected]> wrote:
>
>> This patch implements a blk_yield to allow a process to voluntarily
>> give up its I/O scheduler time slice. This is desirable for those processes
>> which know that they will be blocked on I/O from another process, such as
>> the file system journal thread. Following patches will put calls to blk_yield
>> into jbd and jbd2.
>>
>
> I'm looking through this patch series trying to find the
> analysis/description of the cause for this (bad) performance problem.
> But all I'm seeing is implementation stuff :( It's hard to review code
> with your eyes shut.
Sorry about that, Andrew. The problem case is (for example) iozone when
run with small file sizes (up to 8MB) configured to fsync after each
file is written. Because the iozone process is issuing synchronous
writes, it is put onto CFQ's SYNC service tree. The significance of
this is that CFQ will idle for up to 8ms waiting for requests on such
queues. So, what happens is that the iozone process will issue, say,
64KB worth of write I/O. That I/O will just land in the page cache.
Then, the iozone process does an fsync which forces those I/Os to disk
as synchronous writes. Then, the file system's fsync method is invoked,
and for ext3/4, it calls log_start_commit followed by log_wait_commit.
Because those synchronous writes were forced out in the context of the
iozone process, CFQ will now idle on iozone's cfqq waiting for more I/O.
However, iozone's progress is gated by the journal thread, now.
So, I tried two approaches to solving the problem. The first, which
Christoph brought up again in this thread, was to simply mark all
journal I/O as BIO_RW_META, which would cause the iozone process' cfqq
to be preempted when the journal issued its I/O. However, Vivek pointed
out that this was bad for interactive performance.
The second approach, of which this is the fourth iteration, was to allow
the file system to explicitly tell the I/O scheduler that it is waiting
on I/O from another process.
Does that help? Let me know if you have any more questions, and thanks
a ton for looking at this, Andrew. I appreciate it.
The comments I've elided from my response make perfect sense, so I'll
address them in the next posting.
>> };
>> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
>> - .count = 0, .min_vdisktime = 0, }
>> + .count = 0, .min_vdisktime = 0, .last_expiry = 0UL, \
>> + .last_pid = (pid_t)-1, }
>
> May as well leave the 0 and NULL fields unmentioned (ie: don't do
> crappy stuff because the old code did crappy stuff!)
I don't actually understand why you take issue with this.
>> +{
>> + struct cfq_data *cfqd = q->elevator->elevator_data;
>> + struct cfq_io_context *cic, *new_cic;
>> + struct cfq_queue *cfqq;
>> +
>> + cic = cfq_cic_lookup(cfqd, current->io_context);
>> + if (!cic)
>> + return;
>> +
>> + task_lock(tsk);
>> + new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
>> + atomic_long_inc(&tsk->io_context->refcount);
>
> How do we know tsk has an io_context? Use get_io_context() and check
> its result?
I'll fix that up. It works now only by luck (and the fact that there's
a good chance the journal thread has an i/o context).
>> + task_unlock(tsk);
>> + if (!new_cic)
>> + goto out_dec;
>> +
>> + spin_lock_irq(q->queue_lock);
>> +
>> + cfqq = cic_to_cfqq(cic, 1);
>> + if (!cfqq)
>> + goto out_unlock;
>> +
>> + /*
>> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
>> + * are idling on the last queue in that workload, *and* there are no
>> + * potential dependent readers running currently, then go ahead and
>> + * yield the queue.
>> + */
>
> Comment explains the code, but doesn't explain the *reason* for the code.
Actually, it explains more than just what the code does. It would be
difficult for one to divine that the code actually only really cares
about breaking up a currently running dependent reader. I'll see if I
can make that more clear.
Cheers,
Jeff
On Tue, Jun 22, 2010 at 05:35:00PM -0400, Jeff Moyer wrote:
[..]
> @@ -1614,6 +1620,15 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> cfq_clear_cfqq_wait_request(cfqq);
> cfq_clear_cfqq_wait_busy(cfqq);
>
> + if (!cfq_cfqq_yield(cfqq)) {
> + struct cfq_rb_root *st;
> + st = service_tree_for(cfqq->cfqg,
> + cfqq_prio(cfqq), cfqq_type(cfqq));
> + st->last_expiry = jiffies;
> + st->last_pid = cfqq->pid;
> + }
> + cfq_clear_cfqq_yield(cfqq);
Jeff, I think cfqq is still on service tree at this point of time. If yes,
then we can simply use cfqq->service_tree, instead of calling
service_tree_for().
No clearing of cfqq->yield_to field?
[..]
> /*
> * Select a queue for service. If we have a current active queue,
> * check whether to continue servicing it, or retrieve and set a new one.
> @@ -2187,6 +2232,10 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> * have been idling all along on this queue and it should be
> * ok to wait for this request to complete.
> */
> + if (cfq_cfqq_yield(cfqq) &&
> + cfq_should_yield_now(cfqq, &new_cfqq))
> + goto expire;
> +
I think we can get rid of this condition here and move the yield check
above outside above if condition. This if condition waits for request to
complete from this queue and waits for queue to get busy before slice
expiry. If we have decided to yield the queue, there is no point in
waiting for next request for queue to get busy.
> if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
> && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
> cfqq = NULL;
> @@ -2215,6 +2264,9 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> goto expire;
> }
>
> + if (cfq_cfqq_yield(cfqq) && cfq_should_yield_now(cfqq, &new_cfqq))
> + goto expire;
> +
We can move this check up.
[..]
> +static void cfq_yield(struct request_queue *q, struct task_struct *tsk)
> +{
> + struct cfq_data *cfqd = q->elevator->elevator_data;
> + struct cfq_io_context *cic, *new_cic;
> + struct cfq_queue *cfqq;
> +
> + cic = cfq_cic_lookup(cfqd, current->io_context);
> + if (!cic)
> + return;
> +
> + task_lock(tsk);
> + new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
> + atomic_long_inc(&tsk->io_context->refcount);
> + task_unlock(tsk);
> + if (!new_cic)
> + goto out_dec;
> +
> + spin_lock_irq(q->queue_lock);
> +
> + cfqq = cic_to_cfqq(cic, 1);
> + if (!cfqq)
> + goto out_unlock;
> +
> + /*
> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
> + * are idling on the last queue in that workload, *and* there are no
> + * potential dependent readers running currently, then go ahead and
> + * yield the queue.
> + */
> + if (cfqd->active_queue == cfqq &&
> + cfqd->serving_type == SYNC_NOIDLE_WORKLOAD) {
> + /*
> + * If there's been no I/O from another process in the idle
> + * slice time, then there is by definition no dependent
> + * read going on for this service tree.
> + */
> + if (expiry_data_valid(cfqq->service_tree) &&
> + time_before(cfqq->service_tree->last_expiry +
> + cfq_slice_idle, jiffies) &&
> + cfqq->service_tree->last_pid != cfqq->pid)
> + goto out_unlock;
> + }
> +
> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
> + cfqq->yield_to = new_cic;
We are stashing away a pointer to cic without taking reference?
> + cfq_mark_cfqq_yield(cfqq);
> +
> +out_unlock:
> + spin_unlock_irq(q->queue_lock);
> +out_dec:
> + put_io_context(tsk->io_context);
> +}
> +
> static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
> {
> int dispatched = 0;
> @@ -3123,6 +3234,13 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> if (!cfqq)
> return false;
>
> + /*
> + * If the active queue yielded its timeslice to this queue, let
> + * it preempt.
> + */
> + if (cfq_cfqq_yield(cfqq) && RQ_CIC(rq) == cfqq->yield_to)
> + return true;
> +
I think we need to again if if we are sync-noidle workload then allow
preemption only if no dependent read is currently on, otherwise
sync-noidle service tree loses share.
This version looks much simpler than previous one and is much easier
to understand. I will do some testing on friday and provide you feedback.
Thanks
Vivek
Hi Jeff,
On 06/23/2010 05:30 PM, Tao Ma wrote:
> Hi Jeff,
>
> On 06/23/2010 05:34 AM, Jeff Moyer wrote:
>> Hi,
>>
>> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
>> far worse than that of deadline for enterprise class storage when dealing
>> with file sizes of 8MB or less. I used the following command line as a
>> representative test case:
>>
>> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w
>> 4096 -F
>>
>> When run using the deadline I/O scheduler, an average of the first 5
>> numbers
>> will give you 448.4 files / second. CFQ will yield only 106.7. With
>> this patch series applied (and the two patches I sent yesterday), CFQ now
>> achieves 462.5 files / second.
> which 2 patches? Could you paste the link or the subject? Just want to
> make my test env like yours. ;)
> As Joel mentioned in another mail, ocfs2 also use jbd/jbd2, so I'd like
> to give it a try and give you some feedback about the test.
I am sorry to say that the patch make jbd2 locked up when I tested
fs_mark using ocfs2.
I have attached the log from my netconsole server. After I reverted the
patch [3/3], the box works again.
Regards,
Tao
Tao Ma <[email protected]> writes:
> Hi Jeff,
>
[...]
> I am sorry to say that the patch make jbd2 locked up when I tested
> fs_mark using ocfs2.
> I have attached the log from my netconsole server. After I reverted
> the patch [3/3], the box works again.
Thanks for the report, Tao, I'll try to reproduce it here and get back
to you.
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> On Tue, Jun 22, 2010 at 05:35:00PM -0400, Jeff Moyer wrote:
>
> [..]
>> @@ -1614,6 +1620,15 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> cfq_clear_cfqq_wait_request(cfqq);
>> cfq_clear_cfqq_wait_busy(cfqq);
>>
>> + if (!cfq_cfqq_yield(cfqq)) {
>> + struct cfq_rb_root *st;
>> + st = service_tree_for(cfqq->cfqg,
>> + cfqq_prio(cfqq), cfqq_type(cfqq));
>> + st->last_expiry = jiffies;
>> + st->last_pid = cfqq->pid;
>> + }
>> + cfq_clear_cfqq_yield(cfqq);
>
> Jeff, I think cfqq is still on service tree at this point of time. If yes,
> then we can simply use cfqq->service_tree, instead of calling
> service_tree_for().
Yup.
> No clearing of cfqq->yield_to field?
Nope. Again, it's not required, but if you really want me to, I'll add
it.
> [..]
>> /*
>> * Select a queue for service. If we have a current active queue,
>> * check whether to continue servicing it, or retrieve and set a new one.
>> @@ -2187,6 +2232,10 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>> * have been idling all along on this queue and it should be
>> * ok to wait for this request to complete.
>> */
>> + if (cfq_cfqq_yield(cfqq) &&
>> + cfq_should_yield_now(cfqq, &new_cfqq))
>> + goto expire;
>> +
>
> I think we can get rid of this condition here and move the yield check
> above outside above if condition. This if condition waits for request to
> complete from this queue and waits for queue to get busy before slice
> expiry. If we have decided to yield the queue, there is no point in
> waiting for next request for queue to get busy.
Yeah, this is a vestige of the older code layout. Thanks, this cleans
things up nicely.
>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
>> + cfqq->yield_to = new_cic;
>
> We are stashing away a pointer to cic without taking reference?
There is no reference counting on the cic.
>> @@ -3123,6 +3234,13 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>> if (!cfqq)
>> return false;
>>
>> + /*
>> + * If the active queue yielded its timeslice to this queue, let
>> + * it preempt.
>> + */
>> + if (cfq_cfqq_yield(cfqq) && RQ_CIC(rq) == cfqq->yield_to)
>> + return true;
>> +
>
> I think we need to again if if we are sync-noidle workload then allow
> preemption only if no dependent read is currently on, otherwise
> sync-noidle service tree loses share.
I think you mean don't yield if there is a dependent reader. Yeah,
makes sense.
> This version looks much simpler than previous one and is much easier
> to understand. I will do some testing on friday and provide you feedback.
Great, thanks again for the review!
Cheers,
Jeff
On 25/06/10 18.51, Jeff Moyer wrote:
>>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
>>> + cfqq->yield_to = new_cic;
>>
>> We are stashing away a pointer to cic without taking reference?
>
> There is no reference counting on the cic.
Not on the cic itself, but on the io context it belongs to. So you
need to grab a reference to that, if you are stowing a reference
to the cic.
--
Jens Axboe
Jens Axboe <[email protected]> writes:
> On 25/06/10 18.51, Jeff Moyer wrote:
>>>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
>>>> + cfqq->yield_to = new_cic;
>>>
>>> We are stashing away a pointer to cic without taking reference?
>>
>> There is no reference counting on the cic.
>
> Not on the cic itself, but on the io context it belongs to. So you
> need to grab a reference to that, if you are stowing a reference
> to the cic.
OK, easy enough. Thanks!
Jeff
On Fri, Jun 25, 2010 at 12:51:58PM -0400, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Tue, Jun 22, 2010 at 05:35:00PM -0400, Jeff Moyer wrote:
> >
> > [..]
> >> @@ -1614,6 +1620,15 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >> cfq_clear_cfqq_wait_request(cfqq);
> >> cfq_clear_cfqq_wait_busy(cfqq);
> >>
> >> + if (!cfq_cfqq_yield(cfqq)) {
> >> + struct cfq_rb_root *st;
> >> + st = service_tree_for(cfqq->cfqg,
> >> + cfqq_prio(cfqq), cfqq_type(cfqq));
> >> + st->last_expiry = jiffies;
> >> + st->last_pid = cfqq->pid;
> >> + }
> >> + cfq_clear_cfqq_yield(cfqq);
> >
> > Jeff, I think cfqq is still on service tree at this point of time. If yes,
> > then we can simply use cfqq->service_tree, instead of calling
> > service_tree_for().
>
> Yup.
>
> > No clearing of cfqq->yield_to field?
>
> Nope. Again, it's not required, but if you really want me to, I'll add
> it.
I think clearing up is better as it leaves no scope for confusion.
Thanks
Vivek
Tao Ma <[email protected]> writes:
> Hi Jeff,
>
> On 06/23/2010 05:30 PM, Tao Ma wrote:
>> Hi Jeff,
>>
>> On 06/23/2010 05:34 AM, Jeff Moyer wrote:
>>> Hi,
>>>
>>> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
>>> far worse than that of deadline for enterprise class storage when dealing
>>> with file sizes of 8MB or less. I used the following command line as a
>>> representative test case:
>>>
>>> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w
>>> 4096 -F
>>>
>>> When run using the deadline I/O scheduler, an average of the first 5
>>> numbers
>>> will give you 448.4 files / second. CFQ will yield only 106.7. With
>>> this patch series applied (and the two patches I sent yesterday), CFQ now
>>> achieves 462.5 files / second.
>> which 2 patches? Could you paste the link or the subject? Just want to
>> make my test env like yours. ;)
>> As Joel mentioned in another mail, ocfs2 also use jbd/jbd2, so I'd like
>> to give it a try and give you some feedback about the test.
> I am sorry to say that the patch make jbd2 locked up when I tested
> fs_mark using ocfs2.
> I have attached the log from my netconsole server. After I reverted
> the patch [3/3], the box works again.
I can't reproduce this, unfortunately. Also, when building with the
.config you sent me, the disassembly doesn't line up with the stack
trace you posted.
I'm not sure why yielding the queue would cause a deadlock. The only
explanation I can come up with is that I/O is not being issued. I'm
assuming that no other I/O will be completed to the file system in
question. Is that right? Could you send along the output from sysrq-t?
Cheers,
Jeff
Hi Jeff,
On 06/27/2010 09:48 PM, Jeff Moyer wrote:
> Tao Ma<[email protected]> writes:
>
>> Hi Jeff,
>>
>> On 06/23/2010 05:30 PM, Tao Ma wrote:
>>> Hi Jeff,
>>>
>>> On 06/23/2010 05:34 AM, Jeff Moyer wrote:
>>>> Hi,
>>>>
>>>> Running iozone with the fsync flag, or fs_mark, the performance of CFQ is
>>>> far worse than that of deadline for enterprise class storage when dealing
>>>> with file sizes of 8MB or less. I used the following command line as a
>>>> representative test case:
>>>>
>>>> fs_mark -S 1 -D 10000 -N 100000 -d /mnt/test/fs_mark -s 65536 -t 1 -w
>>>> 4096 -F
>>>>
>>>> When run using the deadline I/O scheduler, an average of the first 5
>>>> numbers
>>>> will give you 448.4 files / second. CFQ will yield only 106.7. With
>>>> this patch series applied (and the two patches I sent yesterday), CFQ now
>>>> achieves 462.5 files / second.
>>> which 2 patches? Could you paste the link or the subject? Just want to
>>> make my test env like yours. ;)
>>> As Joel mentioned in another mail, ocfs2 also use jbd/jbd2, so I'd like
>>> to give it a try and give you some feedback about the test.
>> I am sorry to say that the patch make jbd2 locked up when I tested
>> fs_mark using ocfs2.
>> I have attached the log from my netconsole server. After I reverted
>> the patch [3/3], the box works again.
>
> I can't reproduce this, unfortunately. Also, when building with the
> .config you sent me, the disassembly doesn't line up with the stack
> trace you posted.
>
> I'm not sure why yielding the queue would cause a deadlock. The only
> explanation I can come up with is that I/O is not being issued. I'm
> assuming that no other I/O will be completed to the file system in
> question. Is that right? Could you send along the output from sysrq-t?
yes, I just mounted it and begin the test, so there should be no
outstanding I/O. So do you need me to setup another disk for test?
I have attached the sysrq output in sysrq.log. please check.
btw, I also met with a NULL pointer deference in cfq_yield. I have
attached the null.log also. This seems to be related to the previous
deadlock and happens when I try to remount the same volume after reboot
and ocfs2 try to do some recovery.
Regards,
Tao
Tao Ma <[email protected]> writes:
> btw, I also met with a NULL pointer deference in cfq_yield. I have
> attached the null.log also. This seems to be related to the previous
> deadlock and happens when I try to remount the same volume after
> reboot and ocfs2 try to do some recovery.
Since I can't reproduce your binary kernel even with your .config, could
you send me the disassembly of the cfq_yield function from your vmlinux
binary?
Thanks!
Jeff
ffffffff821614d8 <cfq_yield>:
ffffffff821614d8: 41 56 push %r14
ffffffff821614da: 41 55 push %r13
ffffffff821614dc: 49 89 fd mov %rdi,%r13
ffffffff821614df: 41 54 push %r12
ffffffff821614e1: 55 push %rbp
ffffffff821614e2: 48 89 f5 mov %rsi,%rbp
ffffffff821614e5: 53 push %rbx
ffffffff821614e6: 48 8b 47 18 mov 0x18(%rdi),%rax
ffffffff821614ea: 4c 8b 60 08 mov 0x8(%rax),%r12
ffffffff821614ee: 65 48 8b 04 25 40 b5 mov %gs:0xb540,%rax
ffffffff821614f5: 00 00
ffffffff821614f7: 48 8b b0 00 06 00 00 mov 0x600(%rax),%rsi
ffffffff821614fe: 4c 89 e7 mov %r12,%rdi
ffffffff82161501: e8 90 eb ff ff callq ffffffff82160096 <cfq_cic_lookup>
ffffffff82161506: 48 85 c0 test %rax,%rax
ffffffff82161509: 48 89 c3 mov %rax,%rbx
ffffffff8216150c: 0f 84 f2 00 00 00 je ffffffff82161604 <cfq_yield+0x12c>
ffffffff82161512: 48 8d bd a0 05 00 00 lea 0x5a0(%rbp),%rdi
ffffffff82161519: e8 ab ba 19 00 callq ffffffff822fcfc9 <_raw_spin_lock>
ffffffff8216151e: 48 8b b5 00 06 00 00 mov 0x600(%rbp),%rsi
ffffffff82161525: 4c 89 e7 mov %r12,%rdi
ffffffff82161528: e8 69 eb ff ff callq ffffffff82160096 <cfq_cic_lookup>
ffffffff8216152d: 49 89 c6 mov %rax,%r14
ffffffff82161530: 48 8b 85 00 06 00 00 mov 0x600(%rbp),%rax
ffffffff82161537: f0 48 ff 00 lock incq (%rax)
ffffffff8216153b: fe 85 a0 05 00 00 incb 0x5a0(%rbp)
ffffffff82161541: 4d 85 f6 test %r14,%r14
ffffffff82161544: 0f 84 a6 00 00 00 je ffffffff821615f0 <cfq_yield+0x118>
ffffffff8216154a: 49 8b bd 48 03 00 00 mov 0x348(%r13),%rdi
ffffffff82161551: e8 a0 ba 19 00 callq ffffffff822fcff6 <_raw_spin_lock_irq>
ffffffff82161556: 48 8b 5b 10 mov 0x10(%rbx),%rbx
ffffffff8216155a: 48 85 db test %rbx,%rbx
ffffffff8216155d: 0f 84 83 00 00 00 je ffffffff821615e6 <cfq_yield+0x10e>
ffffffff82161563: 49 39 9c 24 68 03 00 cmp %rbx,0x368(%r12)
ffffffff8216156a: 00
ffffffff8216156b: 75 41 jne ffffffff821615ae <cfq_yield+0xd6>
ffffffff8216156d: 41 83 bc 24 54 02 00 cmpl $0x1,0x254(%r12)
ffffffff82161574: 00 01
ffffffff82161576: 75 36 jne ffffffff821615ae <cfq_yield+0xd6>
ffffffff82161578: 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax
ffffffff8216157f: 8b 70 38 mov 0x38(%rax),%esi
ffffffff82161582: 83 fe ff cmp $0xffffffffffffffff,%esi
ffffffff82161585: 74 27 je ffffffff821615ae <cfq_yield+0xd6>
ffffffff82161587: 48 8b 48 30 mov 0x30(%rax),%rcx
ffffffff8216158b: 48 85 c9 test %rcx,%rcx
ffffffff8216158e: 74 1e je ffffffff821615ae <cfq_yield+0xd6>
ffffffff82161590: 48 63 05 4d f2 5d 00 movslq 6156877(%rip),%rax # ffffffff827407e4 <cfq_slice_idle>
ffffffff82161597: 48 8b 15 e2 42 63 00 mov 6505186(%rip),%rdx # ffffffff82795880 <jiffies>
ffffffff8216159e: 48 01 c8 add %rcx,%rax
ffffffff821615a1: 48 39 d0 cmp %rdx,%rax
ffffffff821615a4: 79 08 jns ffffffff821615ae <cfq_yield+0xd6>
ffffffff821615a6: 3b b3 c0 00 00 00 cmp 0xc0(%rbx),%esi
ffffffff821615ac: 75 38 jne ffffffff821615e6 <cfq_yield+0x10e>
ffffffff821615ae: 49 8b 04 24 mov (%r12),%rax
ffffffff821615b2: 48 8b b8 80 04 00 00 mov 0x480(%rax),%rdi
ffffffff821615b9: 48 85 ff test %rdi,%rdi
ffffffff821615bc: 74 1a je ffffffff821615d8 <cfq_yield+0x100>
ffffffff821615be: 8b 8d 68 02 00 00 mov 0x268(%rbp),%ecx
ffffffff821615c4: 8b 93 c0 00 00 00 mov 0xc0(%rbx),%edx
ffffffff821615ca: 48 c7 c6 1d 8c 55 82 mov $0xffffffff82558c1d,%rsi
ffffffff821615d1: 31 c0 xor %eax,%eax
ffffffff821615d3: e8 a1 07 f2 ff callq ffffffff82081d79 <__trace_note_message>
ffffffff821615d8: 81 4b 04 00 20 00 00 orl $0x2000,0x4(%rbx)
ffffffff821615df: 4c 89 b3 f0 00 00 00 mov %r14,0xf0(%rbx)
ffffffff821615e6: 49 8b 85 48 03 00 00 mov 0x348(%r13),%rax
ffffffff821615ed: fe 00 incb (%rax)
ffffffff821615ef: fb sti
ffffffff821615f0: 5b pop %rbx
ffffffff821615f1: 48 8b bd 00 06 00 00 mov 0x600(%rbp),%rdi
ffffffff821615f8: 5d pop %rbp
ffffffff821615f9: 41 5c pop %r12
ffffffff821615fb: 41 5d pop %r13
ffffffff821615fd: 41 5e pop %r14
ffffffff821615ff: e9 51 8b ff ff jmpq ffffffff8215a155 <put_io_context>
ffffffff82161604: 5b pop %rbx
ffffffff82161605: 5d pop %rbp
ffffffff82161606: 41 5c pop %r12
ffffffff82161608: 41 5d pop %r13
ffffffff8216160a: 41 5e pop %r14
ffffffff8216160c: c3 retq
Tao Ma <[email protected]> writes:
> Hi Jeff,
>
> On 06/27/2010 09:48 PM, Jeff Moyer wrote:
>> Tao Ma<[email protected]> writes:
>>> I am sorry to say that the patch make jbd2 locked up when I tested
>>> fs_mark using ocfs2.
>>> I have attached the log from my netconsole server. After I reverted
>>> the patch [3/3], the box works again.
>>
>> I can't reproduce this, unfortunately. Also, when building with the
>> .config you sent me, the disassembly doesn't line up with the stack
>> trace you posted.
>>
>> I'm not sure why yielding the queue would cause a deadlock. The only
>> explanation I can come up with is that I/O is not being issued. I'm
>> assuming that no other I/O will be completed to the file system in
>> question. Is that right? Could you send along the output from sysrq-t?
> yes, I just mounted it and begin the test, so there should be no
> outstanding I/O. So do you need me to setup another disk for test?
> I have attached the sysrq output in sysrq.log. please check.
Well, if it doesn't take long to reproduce, then it might be helpful to
see a blktrace of the run. However, it might also just be worth waiting
for the next version of the patch to see if that fixes your issue.
> btw, I also met with a NULL pointer deference in cfq_yield. I have
> attached the null.log also. This seems to be related to the previous
> deadlock and happens when I try to remount the same volume after
> reboot and ocfs2 try to do some recovery.
Pid: 4130, comm: ocfs2_wq Not tainted 2.6.35-rc3+ #5 0MM599/OptiPlex 745
RIP: 0010:[<ffffffff82161537>]
[<ffffffff82161537>] cfq_yield+0x5f/0x135
RSP: 0018:ffff880123061c60 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88012c2b5ea8 RCX: ffff88012c3a30d0
ffffffff82161528: e8 69 eb ff ff callq ffffffff82160096 <cfq_cic_lookup>
ffffffff8216152d: 49 89 c6 mov %rax,%r14
ffffffff82161530: 48 8b 85 00 06 00 00 mov 0x600(%rbp),%rax
ffffffff82161537: f0 48 ff 00 lock incq (%rax)
I'm pretty sure that's a NULL pointer deref of the tsk->iocontext that
was passed into the yield function. I've since fixed that, so your
recovery code should be safe in the newest version (which I've not yet
posted).
Cheers,
Jeff
Hi Jeff,
On 06/29/2010 10:56 PM, Jeff Moyer wrote:
> Tao Ma<[email protected]> writes:
>
>> Hi Jeff,
>>
>> On 06/27/2010 09:48 PM, Jeff Moyer wrote:
>>> Tao Ma<[email protected]> writes:
>>>> I am sorry to say that the patch make jbd2 locked up when I tested
>>>> fs_mark using ocfs2.
>>>> I have attached the log from my netconsole server. After I reverted
>>>> the patch [3/3], the box works again.
>>>
>>> I can't reproduce this, unfortunately. Also, when building with the
>>> .config you sent me, the disassembly doesn't line up with the stack
>>> trace you posted.
>>>
>>> I'm not sure why yielding the queue would cause a deadlock. The only
>>> explanation I can come up with is that I/O is not being issued. I'm
>>> assuming that no other I/O will be completed to the file system in
>>> question. Is that right? Could you send along the output from sysrq-t?
>> yes, I just mounted it and begin the test, so there should be no
>> outstanding I/O. So do you need me to setup another disk for test?
>> I have attached the sysrq output in sysrq.log. please check.
>
> Well, if it doesn't take long to reproduce, then it might be helpful to
> see a blktrace of the run. However, it might also just be worth waiting
> for the next version of the patch to see if that fixes your issue.
>
>> btw, I also met with a NULL pointer deference in cfq_yield. I have
>> attached the null.log also. This seems to be related to the previous
>> deadlock and happens when I try to remount the same volume after
>> reboot and ocfs2 try to do some recovery.
>
> Pid: 4130, comm: ocfs2_wq Not tainted 2.6.35-rc3+ #5 0MM599/OptiPlex 745
> RIP: 0010:[<ffffffff82161537>]
> [<ffffffff82161537>] cfq_yield+0x5f/0x135
> RSP: 0018:ffff880123061c60 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88012c2b5ea8 RCX: ffff88012c3a30d0
>
> ffffffff82161528: e8 69 eb ff ff callq ffffffff82160096<cfq_cic_lookup>
> ffffffff8216152d: 49 89 c6 mov %rax,%r14
> ffffffff82161530: 48 8b 85 00 06 00 00 mov 0x600(%rbp),%rax
> ffffffff82161537: f0 48 ff 00 lock incq (%rax)
>
> I'm pretty sure that's a NULL pointer deref of the tsk->iocontext that
> was passed into the yield function. I've since fixed that, so your
> recovery code should be safe in the newest version (which I've not yet
> posted).
ok, so could you please cc me when the new patches are out? It would be
easier for me to track it. Thanks.
Regards,
Tao