2011-06-27 20:19:30

by Vivek Goyal

[permalink] [raw]
Subject: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

Hi,

Konstantin reported that fsync is very slow with ext4 if fsyncing process
is in a separate cgroup and one is using CFQ IO scheduler.

https://lkml.org/lkml/2011/6/23/269

Issue seems to be that fsync process is in a separate cgroup and journalling
thread is in root cgroup. After every IO from fsync, CFQ idles on fysnc
process queue waiting for more requests to come. But this process is now
waiting for IO to finish from journaling thread. After waiting for 8ms
fsync's queue gives way to jbd's queue. Then we start idling on jbd
thread and new IO from fsync is sitting in a separate queue in a separate
group.

Bottom line, that after every IO we end up idling on fysnc and jbd thread
so much that if somebody is doing fsync after every 4K of IO, throughput
nose dives.

Similar issue had issue come up with-in same cgroup also when "fsync"
and "jbd" thread were being queued on differnt service trees and idling
was killing. At that point of time two solutions were proposed. One
from Jeff Moyer and one from Corrado Zoccolo.

Jeff came up with the idea of coming with block layer API to yield the
queue if explicitly told by file system, hence cutting down on idling.

https://lkml.org/lkml/2010/7/2/277

Corrado, came up with a simpler approach of keeping jbd and fsync processes
on same service tree by parsing RQ_NOIDLE flag. By queuing on same service
tree, one queue preempts other queue hence cutting down on idling time.
Upstream went ahead with simpler approach to fix the issue.

commit 749ef9f8423054e326f3a246327ed2db4b6d395f
Author: Corrado Zoccolo <[email protected]>
Date: Mon Sep 20 15:24:50 2010 +0200

cfq: improve fsync performance for small files


Now with cgroups, same problem resurfaces but this time we can not queue
both the processes on same service tree and take advantage of preemption
as separate cgroups have separate service trees and both processes
belong to separate cgroups. We do not allow cross cgroup preemption
as that wil break down the isolation between groups.

So this patch series resurrects Jeff's solution of file system specifying
the IO dependencies between threads explicitly to the block layer/ioscheduler.
One ioscheduler knows that current queue we are idling on is dependent on
IO from some other queue, CFQ allows dispatch of requests from that other
queue in the context of current active queue.

So if fysnc thread specifies the dependency on journalling thread, then
when time slice of fsync thread is running, it allows dispatch from
jbd in the time slice of fsync thread. Hence cutting down on idling.

This patch series seems to be working for me. I did testing for ext4 only.
This series is based on for-3.1/core branch of Jen's block tree.
Konstantin, can you please give it a try and see if it fixes your
issue.

Any feedback on how to solve this issue is appreciated.

Thanks
Vivek


Vivek Goyal (3):
block: A new interface for specifying IO dependencing among tasks
ext4: Explicitly specify fsync dependency on journaling thread
ext3: Explicitly specify fsync dependency on journaling thread

block/blk-core.c | 42 ++++++++
block/cfq-iosched.c | 236 ++++++++++++++++++++++++++++++++++++++++++---
block/elevator.c | 16 +++
fs/ext3/fsync.c | 3 +
fs/ext4/fsync.c | 3 +
include/linux/blkdev.h | 8 ++-
include/linux/elevator.h | 6 +
7 files changed, 297 insertions(+), 17 deletions(-)

--
1.7.4.4


2011-06-27 20:19:14

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 1/3] block: A new interface for specifying IO dependencing among tasks

This patch introduces two new block layer interfaces which allows file
systems to specify IO dependencies among processes. For example, fsync
process is dependent on IO completion from journallig thread in ext3
and ext4.

blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
blk_reset_depends_on_task(struct request_queue *q)

First one allows one to say that current process is dependent on
IO completions from "tsk". This call will let CFQ know abou the
dependency and CFQ will allow "tsk"'s IO to be dispatched in the
allocated time slice of calling process.

Once the dependency is over, one needs to tear down the mapping
by calling reset function.

Once the dependency is setup, and if CFQ decides to idle on a
queue because there are no more requests, it also checks the
requests in dependent cfqq and if requests are available, instead
of idling, it dispatches requests from dependent queue.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-core.c | 42 ++++++++
block/cfq-iosched.c | 236 ++++++++++++++++++++++++++++++++++++++++++---
block/elevator.c | 16 +++
include/linux/blkdev.h | 8 ++-
include/linux/elevator.h | 6 +
5 files changed, 291 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..c6ae665 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -398,6 +398,19 @@ static int blk_init_free_list(struct request_queue *q)
return 0;
}

+static void
+generic_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->elevator)
+ elv_set_depends_on_task(q, tsk);
+}
+
+static void generic_reset_depends_on_task(struct request_queue *q)
+{
+ if (q->elevator)
+ elv_reset_depends_on_task(q);
+}
+
struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
{
return blk_alloc_queue_node(gfp_mask, -1);
@@ -534,6 +547,8 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
q->prep_rq_fn = NULL;
q->unprep_rq_fn = NULL;
q->queue_flags = QUEUE_FLAG_DEFAULT;
+ q->set_depends_on_task_fn = generic_set_depends_on_task;
+ q->reset_depends_on_task_fn = generic_reset_depends_on_task;

/* Override internal queue lock with supplied lock pointer */
if (lock)
@@ -2766,6 +2781,33 @@ void blk_finish_plug(struct blk_plug *plug)
}
EXPORT_SYMBOL(blk_finish_plug);

+/*
+ * Give a hint to CFQ that current task is dependent on IO coming from
+ * "tsk". In such cases, CFQ will allow dispatch from "tsk" queue in
+ * the time slice of "current" process and this can cut down on
+ * unnecessarily queue idling.
+ *
+ * This function will return with interrupts disabled in case of CFQ.
+ * (task_lock()/task_unlock() pair).
+ */
+void blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->set_depends_on_task_fn)
+ q->set_depends_on_task_fn(q, tsk);
+}
+EXPORT_SYMBOL(blk_set_depends_on_task);
+
+/*
+ * Tear down the any dependent task mapping previously set up by the current
+ * task.
+ */
+void blk_reset_depends_on_task(struct request_queue *q)
+{
+ if (q->reset_depends_on_task_fn)
+ q->reset_depends_on_task_fn(q);
+}
+EXPORT_SYMBOL(blk_reset_depends_on_task);
+
int __init blk_dev_init(void)
{
BUILD_BUG_ON(__REQ_NR_BITS > 8 *
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3d403a1..6ff0b4c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -75,6 +75,8 @@ static DEFINE_IDA(cic_index_ida);
#define sample_valid(samples) ((samples) > 80)
#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)

+static void cfq_put_request(struct request *rq);
+
/*
* Most of our rbtree usage is for sorting with min extraction, so
* if we cache the leftmost node we don't have to walk down the tree
@@ -148,6 +150,12 @@ struct cfq_queue {
struct cfq_group *cfqg;
/* Number of sectors dispatched from queue in single dispatch round */
unsigned long nr_sectors;
+
+ /*
+ * This cfqq's further IO is dependent on other IO queued in other
+ * cfqq pointed by dep_on_cfqq.
+ */
+ struct cfq_queue *depends_on;
};

/*
@@ -447,7 +455,7 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
}

-static void cfq_dispatch_insert(struct request_queue *, struct request *);
+static void cfq_dispatch_insert(struct request_queue *, struct request *, bool);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
@@ -2043,16 +2051,23 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)

/*
* Move request from internal lists to the request queue dispatch list.
+ * @rq_dequeued: rq to be dispatched has already been removed from associated
+ * cfqq. This is useful when rq from dependent queue is being
+ * dispatched in current queue context.
*/
-static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
+static void cfq_dispatch_insert(struct request_queue *q, struct request *rq,
+ bool rq_dequeued)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);

cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");

- cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
- cfq_remove_request(rq);
+ if (!rq_dequeued) {
+ cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
+ cfq_remove_request(rq);
+ }
+
cfqq->dispatched++;
(RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);
@@ -2332,6 +2347,10 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
if (!RB_EMPTY_ROOT(&cfqq->sort_list))
goto keep_queue;

+ /* There are reuquests in the cfqq we depend on. Allow dispatch */
+ if (cfqq->depends_on && !RB_EMPTY_ROOT(&cfqq->depends_on->sort_list))
+ goto keep_queue;
+
/*
* If another queue has a request waiting within our mean seek
* distance, let it run. The expire code will check for close
@@ -2402,7 +2421,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
int dispatched = 0;

while (cfqq->next_rq) {
- cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq);
+ cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq, false);
dispatched++;
}

@@ -2534,6 +2553,77 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
}

/*
+ * This queue was not active and we might expire it becaue its request got
+ * dispatched in some other queue's context and it is an empty queue now
+ */
+static void
+cfq_expire_inactive_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ /*
+ * If this cfqq is shared between multiple processes, check to
+ * make sure that those processes are still issuing I/Os within
+ * the mean seek distance. If not, it may be time to break the
+ * queues apart again.
+ */
+ if (cfq_cfqq_coop(cfqq) && CFQQ_SEEKY(cfqq))
+ cfq_mark_cfqq_split_coop(cfqq);
+
+ cfq_del_cfqq_rr(cfqd, cfqq);
+}
+
+static void cfq_set_dispatch_dependent_request(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq)
+{
+ struct cfq_queue *dep_cfqq;
+ int rw;
+ struct request *rq;
+
+ dep_cfqq = cfqq->depends_on;
+
+ cfq_log_cfqq(cfqd, cfqq, "dispatch from dependent"
+ " queue pid=%d\n", dep_cfqq->pid);
+ /*
+ * Select a request from the queue we are dependent on. Dequeue
+ * the request from other queue and make rq belong to this
+ * queue and dispatch in this queue's context
+ */
+ rq = cfq_check_fifo(dep_cfqq);
+ if (!rq)
+ rq = dep_cfqq->next_rq;
+ cfq_remove_request(rq);
+ if (RB_EMPTY_ROOT(&dep_cfqq->sort_list))
+ cfq_expire_inactive_queue(cfqd, dep_cfqq);
+ /*
+ * Change the identity of request to belong to current cfqq
+ * cfqg and cic. Drop references to old cfqq, cfqg and cic.
+ */
+ cfqq->ref++;
+ rw = rq_data_dir(rq);
+ cfqq->allocated[rw]++;
+
+ /*
+ * If we are here that means we are idling on the queue and we
+ * must have dispatched atleast one request and that must have
+ * set the cfqd->active_cic. Use that
+ */
+ BUG_ON(!cfqd->active_cic);
+ atomic_long_inc(&cfqd->active_cic->ioc->refcount);
+
+ cfq_put_request(rq);
+
+ rq->elevator_private[0] = cfqd->active_cic;
+ rq->elevator_private[1] = cfqq;
+ rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+
+ if (cfq_cfqq_wait_request(cfqq))
+ cfq_del_timer(cfqd, cfqq);
+
+ cfq_clear_cfqq_wait_request(cfqq);
+
+ cfq_dispatch_insert(cfqd->queue, rq, true);
+}
+
+/*
* Dispatch a request from cfqq, moving them to the request queue
* dispatch list.
*/
@@ -2541,22 +2631,26 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
struct request *rq;

- BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
+ BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list) && !cfqq->depends_on);

if (!cfq_may_dispatch(cfqd, cfqq))
return false;

- /*
- * follow expired path, else get first next available
- */
- rq = cfq_check_fifo(cfqq);
- if (!rq)
- rq = cfqq->next_rq;
+ if (RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_set_dispatch_dependent_request(cfqd, cfqq);
+ else {
+ /*
+ * follow expired path, else get first next available
+ */
+ rq = cfq_check_fifo(cfqq);
+ if (!rq)
+ rq = cfqq->next_rq;

- /*
- * insert request into driver dispatch list
- */
- cfq_dispatch_insert(cfqd->queue, rq);
+ /*
+ * insert request into driver dispatch list
+ */
+ cfq_dispatch_insert(cfqd->queue, rq, false);
+ }

if (!cfqd->active_cic) {
struct cfq_io_context *cic = RQ_CIC(rq);
@@ -2640,6 +2734,11 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
}

BUG_ON(cfq_cfqq_on_rr(cfqq));
+
+ /* This cfqq is going away. If there is a dependent queue, drop ref */
+ if (cfqq->depends_on)
+ cfq_put_queue(cfqq->depends_on);
+
kmem_cache_free(cfq_pool, cfqq);
cfq_put_cfqg(cfqg);
}
@@ -3670,6 +3769,109 @@ static int cfq_may_queue(struct request_queue *q, int rw)
}

/*
+ * Calling task depends on "tsk" for further IO. It is caller's responsibility
+ * to make sure tsk pointer is valid during the execution of call
+ */
+static void
+cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ struct cfq_io_context *cic, *tsk_cic;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq, *tsk_cfqq, *__cfqq;
+
+ if (unlikely(!tsk))
+ return;
+
+ if (!tsk->io_context || !current->io_context)
+ return;
+
+ /*
+ * If two processes belong to same cgroup, no need to do this as
+ * both journalling thread and fsync process will go on same
+ * service tree and be able to preempt each other
+ */
+ rcu_read_lock();
+ if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
+ return;
+ rcu_read_unlock();
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ task_lock(tsk);
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq)
+ goto out_unlock;
+
+ /* If cfqq already has a dependent queue, ignore this new queue */
+ if (cfqq->depends_on) {
+ cfq_log_cfqq(cfqd, cfqq, "depends on queue already set"
+ " old_pid=%d", cfqq->depends_on->pid);
+ goto out_unlock;
+ }
+
+ if (!tsk->io_context)
+ goto out_unlock;
+
+ tsk_cic = cfq_cic_lookup(cfqd, tsk->io_context);
+ if (!tsk_cic)
+ goto out_unlock;
+
+ tsk_cfqq = cic_to_cfqq(tsk_cic, 1);
+
+ if (!tsk_cfqq)
+ goto out_unlock;
+
+ /* Don't allow circular dependency among a group of queues */
+ __cfqq = tsk_cfqq;
+
+ while((__cfqq = __cfqq->depends_on)) {
+ if (__cfqq == cfqq)
+ goto out_unlock;
+ }
+
+ /* Take reference on tasks' cfqq */
+ tsk_cfqq->ref++;
+ cfqq->depends_on = tsk_cfqq;
+
+ cfq_log_cfqq(cfqd, cfqq, "set depends on queue pid= %d", tsk->pid);
+
+out_unlock:
+ spin_unlock_irq(q->queue_lock);
+ task_unlock(tsk);
+}
+
+static void cfq_reset_depends_on_task(struct request_queue *q)
+{
+ struct cfq_io_context *cic;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq;
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq || !cfqq->depends_on) {
+ spin_unlock_irq(q->queue_lock);
+ return;
+ }
+
+ cfq_log_cfqq(cfqd, cfqq, "reset depends on queue");
+
+ /* Drop refenrece on tasks's cfqq */
+ cfq_put_queue(cfqq->depends_on);
+ cfqq->depends_on = NULL;
+ spin_unlock_irq(q->queue_lock);
+
+}
+
+/*
* queue lock held here
*/
static void cfq_put_request(struct request *rq)
@@ -4206,6 +4408,8 @@ static struct elevator_type iosched_cfq = {
.elevator_set_req_fn = cfq_set_request,
.elevator_put_req_fn = cfq_put_request,
.elevator_may_queue_fn = cfq_may_queue,
+ .elevator_set_depends_on_task_fn = cfq_set_depends_on_task,
+ .elevator_reset_depends_on_task_fn = cfq_reset_depends_on_task,
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
diff --git a/block/elevator.c b/block/elevator.c
index a3b64bc..6780914 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -783,6 +783,22 @@ int elv_may_queue(struct request_queue *q, int rw)
return ELV_MQUEUE_MAY;
}

+void elv_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_set_depends_on_task_fn)
+ e->ops->elevator_set_depends_on_task_fn(q, tsk);
+}
+
+void elv_reset_depends_on_task(struct request_queue *q)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_reset_depends_on_task_fn)
+ e->ops->elevator_reset_depends_on_task_fn(q);
+}
+
void elv_abort_queue(struct request_queue *q)
{
struct request *rq;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4ce6e68..f8d27f9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,6 +209,8 @@ typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
typedef void (softirq_done_fn)(struct request *);
typedef int (dma_drain_needed_fn)(struct request *);
typedef int (lld_busy_fn) (struct request_queue *q);
+typedef void (set_depends_on_task_fn) (struct request_queue *q, struct task_struct *tsk);
+typedef void (reset_depends_on_task_fn) (struct request_queue *q);

enum blk_eh_timer_return {
BLK_EH_NOT_HANDLED,
@@ -283,7 +285,8 @@ struct request_queue
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
-
+ set_depends_on_task_fn *set_depends_on_task_fn;
+ reset_depends_on_task_fn *reset_depends_on_task_fn;
/*
* Dispatch queue sorting
*/
@@ -895,6 +898,9 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
}

+extern void blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk);
+extern void blk_reset_depends_on_task(struct request_queue *q);
+
/*
* tag stuff
*/
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d800d51..0bc82f8 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,6 +23,8 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);
+typedef void (elevator_set_depends_on_task_fn) (struct request_queue *, struct task_struct *);
+typedef void (elevator_reset_depends_on_task_fn) (struct request_queue *);

typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
typedef void (elevator_put_req_fn) (struct request *);
@@ -54,6 +56,8 @@ struct elevator_ops
elevator_put_req_fn *elevator_put_req_fn;

elevator_may_queue_fn *elevator_may_queue_fn;
+ elevator_set_depends_on_task_fn *elevator_set_depends_on_task_fn;
+ elevator_reset_depends_on_task_fn *elevator_reset_depends_on_task_fn;

elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
@@ -114,6 +118,8 @@ extern struct request *elv_latter_request(struct request_queue *, struct request
extern int elv_register_queue(struct request_queue *q);
extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
+extern void elv_set_depends_on_task(struct request_queue *q, struct task_struct *tsk);
+extern void elv_reset_depends_on_task(struct request_queue *q);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
--
1.7.4.4

2011-06-27 20:19:05

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 2/3] ext4: Explicitly specify fsync dependency on journaling thread

Set/reset fsync dependency of fsync on journalling thread. This allows
CFQ to dispatch IO from journalling thread in fsync's time slice.
Otherwise, lots of cfq queue idling takes place if fsync is running
in a separate cgroup and journalling thread is running in root group.

Signed-off-by: Vivek Goyal <[email protected]>
---
fs/ext4/fsync.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index ce66d2f..8b16a90 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -216,7 +216,10 @@ int ext4_sync_file(struct file *file, int datasync)
!jbd2_trans_will_send_data_barrier(journal, commit_tid))
needs_barrier = true;
jbd2_log_start_commit(journal, commit_tid);
+ blk_set_depends_on_task(journal->j_dev->bd_disk->queue,
+ journal->j_task);
ret = jbd2_log_wait_commit(journal, commit_tid);
+ blk_reset_depends_on_task(journal->j_dev->bd_disk->queue);
if (needs_barrier)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
out:
--
1.7.4.4

2011-06-27 20:19:23

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 3/3] ext3: Explicitly specify fsync dependency on journaling thread

Set/reset fsync dependency of fsync on journalling thread. This allows
CFQ to dispatch IO from journalling thread in fsync's time slice.
Otherwise, lots of cfq queue idling takes place if fsync is running
in a separate cgroup and journalling thread is running in root group.

Signed-off-by: Vivek Goyal <[email protected]>
---
fs/ext3/fsync.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 09b13bb..1e8de9c 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -82,7 +82,10 @@ int ext3_sync_file(struct file *file, int datasync)
!journal_trans_will_send_data_barrier(journal, commit_tid))
needs_barrier = 1;
log_start_commit(journal, commit_tid);
+ blk_set_depends_on_task(journal->j_dev->bd_disk->queue,
+ journal->j_task);
ret = log_wait_commit(journal, commit_tid);
+ blk_reset_depends_on_task(journal->j_dev->bd_disk->queue);

/*
* In case we didn't commit a transaction, we have to flush
--
1.7.4.4

2011-06-28 01:19:13

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, 2011-06-28 at 04:17 +0800, Vivek Goyal wrote:
> Hi,
>
> Konstantin reported that fsync is very slow with ext4 if fsyncing process
> is in a separate cgroup and one is using CFQ IO scheduler.
>
> https://lkml.org/lkml/2011/6/23/269
>
> Issue seems to be that fsync process is in a separate cgroup and journalling
> thread is in root cgroup. After every IO from fsync, CFQ idles on fysnc
> process queue waiting for more requests to come. But this process is now
> waiting for IO to finish from journaling thread. After waiting for 8ms
> fsync's queue gives way to jbd's queue. Then we start idling on jbd
> thread and new IO from fsync is sitting in a separate queue in a separate
> group.
>
> Bottom line, that after every IO we end up idling on fysnc and jbd thread
> so much that if somebody is doing fsync after every 4K of IO, throughput
> nose dives.
>
> Similar issue had issue come up with-in same cgroup also when "fsync"
> and "jbd" thread were being queued on differnt service trees and idling
> was killing. At that point of time two solutions were proposed. One
> from Jeff Moyer and one from Corrado Zoccolo.
>
> Jeff came up with the idea of coming with block layer API to yield the
> queue if explicitly told by file system, hence cutting down on idling.
>
> https://lkml.org/lkml/2010/7/2/277
>
> Corrado, came up with a simpler approach of keeping jbd and fsync processes
> on same service tree by parsing RQ_NOIDLE flag. By queuing on same service
> tree, one queue preempts other queue hence cutting down on idling time.
> Upstream went ahead with simpler approach to fix the issue.
>
> commit 749ef9f8423054e326f3a246327ed2db4b6d395f
> Author: Corrado Zoccolo <[email protected]>
> Date: Mon Sep 20 15:24:50 2010 +0200
>
> cfq: improve fsync performance for small files
>
>
> Now with cgroups, same problem resurfaces but this time we can not queue
> both the processes on same service tree and take advantage of preemption
> as separate cgroups have separate service trees and both processes
> belong to separate cgroups. We do not allow cross cgroup preemption
> as that wil break down the isolation between groups.
>
> So this patch series resurrects Jeff's solution of file system specifying
> the IO dependencies between threads explicitly to the block layer/ioscheduler.
> One ioscheduler knows that current queue we are idling on is dependent on
> IO from some other queue, CFQ allows dispatch of requests from that other
> queue in the context of current active queue.
>
> So if fysnc thread specifies the dependency on journalling thread, then
> when time slice of fsync thread is running, it allows dispatch from
> jbd in the time slice of fsync thread. Hence cutting down on idling.
>
> This patch series seems to be working for me. I did testing for ext4 only.
> This series is based on for-3.1/core branch of Jen's block tree.
> Konstantin, can you please give it a try and see if it fixes your
> issue.
>
> Any feedback on how to solve this issue is appreciated.
Hi Vivek,
can we introduce a group think time check in cfq? say in a group the
last queue is backed for the group and the queue is a non-idle queue, if
the group think time is big, we don't allow the group idle and preempt
could happen. The fsync thread is a non-idle queue with Corrado's patch,
this allows fast group switch.

Thanks,
Shaohua

2011-06-28 01:41:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 09:18:52AM +0800, Shaohua Li wrote:
> On Tue, 2011-06-28 at 04:17 +0800, Vivek Goyal wrote:
> > Hi,
> >
> > Konstantin reported that fsync is very slow with ext4 if fsyncing process
> > is in a separate cgroup and one is using CFQ IO scheduler.
> >
> > https://lkml.org/lkml/2011/6/23/269
> >
> > Issue seems to be that fsync process is in a separate cgroup and journalling
> > thread is in root cgroup. After every IO from fsync, CFQ idles on fysnc
> > process queue waiting for more requests to come. But this process is now
> > waiting for IO to finish from journaling thread. After waiting for 8ms
> > fsync's queue gives way to jbd's queue. Then we start idling on jbd
> > thread and new IO from fsync is sitting in a separate queue in a separate
> > group.
> >
> > Bottom line, that after every IO we end up idling on fysnc and jbd thread
> > so much that if somebody is doing fsync after every 4K of IO, throughput
> > nose dives.
> >
> > Similar issue had issue come up with-in same cgroup also when "fsync"
> > and "jbd" thread were being queued on differnt service trees and idling
> > was killing. At that point of time two solutions were proposed. One
> > from Jeff Moyer and one from Corrado Zoccolo.
> >
> > Jeff came up with the idea of coming with block layer API to yield the
> > queue if explicitly told by file system, hence cutting down on idling.
> >
> > https://lkml.org/lkml/2010/7/2/277
> >
> > Corrado, came up with a simpler approach of keeping jbd and fsync processes
> > on same service tree by parsing RQ_NOIDLE flag. By queuing on same service
> > tree, one queue preempts other queue hence cutting down on idling time.
> > Upstream went ahead with simpler approach to fix the issue.
> >
> > commit 749ef9f8423054e326f3a246327ed2db4b6d395f
> > Author: Corrado Zoccolo <[email protected]>
> > Date: Mon Sep 20 15:24:50 2010 +0200
> >
> > cfq: improve fsync performance for small files
> >
> >
> > Now with cgroups, same problem resurfaces but this time we can not queue
> > both the processes on same service tree and take advantage of preemption
> > as separate cgroups have separate service trees and both processes
> > belong to separate cgroups. We do not allow cross cgroup preemption
> > as that wil break down the isolation between groups.
> >
> > So this patch series resurrects Jeff's solution of file system specifying
> > the IO dependencies between threads explicitly to the block layer/ioscheduler.
> > One ioscheduler knows that current queue we are idling on is dependent on
> > IO from some other queue, CFQ allows dispatch of requests from that other
> > queue in the context of current active queue.
> >
> > So if fysnc thread specifies the dependency on journalling thread, then
> > when time slice of fsync thread is running, it allows dispatch from
> > jbd in the time slice of fsync thread. Hence cutting down on idling.
> >
> > This patch series seems to be working for me. I did testing for ext4 only.
> > This series is based on for-3.1/core branch of Jen's block tree.
> > Konstantin, can you please give it a try and see if it fixes your
> > issue.
> >
> > Any feedback on how to solve this issue is appreciated.
> Hi Vivek,
> can we introduce a group think time check in cfq? say in a group the
> last queue is backed for the group and the queue is a non-idle queue, if
> the group think time is big, we don't allow the group idle and preempt
> could happen. The fsync thread is a non-idle queue with Corrado's patch,
> this allows fast group switch.

In this case regular queue idle is hitting and not group idle. So some
kind of think time stats probably might be useful for group idle check
but not necessarily for queue idle.

Secondly, for this case think time will change. If you stop idling on
fsync and jbd threads, both will be dispatching IOs fast and both will
have small thinktime. We will think that thinktime is small so we
will enable idle. Then there think time will increase as both will
get blocked behind each other. And then we will removing idling. So
looks like we will be oscillating between enabling and disabling
think time.

If we don't allow idling on sync-no-idle queues, then basic CFQ will
be broken.

Actually in your statement there are multiple suggestions and I might
have actually missed the gist of it.

Thanks
Vivek

2011-06-28 02:05:31

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, 2011-06-28 at 09:40 +0800, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 09:18:52AM +0800, Shaohua Li wrote:
> > On Tue, 2011-06-28 at 04:17 +0800, Vivek Goyal wrote:
> > > Hi,
> > >
> > > Konstantin reported that fsync is very slow with ext4 if fsyncing process
> > > is in a separate cgroup and one is using CFQ IO scheduler.
> > >
> > > https://lkml.org/lkml/2011/6/23/269
> > >
> > > Issue seems to be that fsync process is in a separate cgroup and journalling
> > > thread is in root cgroup. After every IO from fsync, CFQ idles on fysnc
> > > process queue waiting for more requests to come. But this process is now
> > > waiting for IO to finish from journaling thread. After waiting for 8ms
> > > fsync's queue gives way to jbd's queue. Then we start idling on jbd
> > > thread and new IO from fsync is sitting in a separate queue in a separate
> > > group.
> > >
> > > Bottom line, that after every IO we end up idling on fysnc and jbd thread
> > > so much that if somebody is doing fsync after every 4K of IO, throughput
> > > nose dives.
> > >
> > > Similar issue had issue come up with-in same cgroup also when "fsync"
> > > and "jbd" thread were being queued on differnt service trees and idling
> > > was killing. At that point of time two solutions were proposed. One
> > > from Jeff Moyer and one from Corrado Zoccolo.
> > >
> > > Jeff came up with the idea of coming with block layer API to yield the
> > > queue if explicitly told by file system, hence cutting down on idling.
> > >
> > > https://lkml.org/lkml/2010/7/2/277
> > >
> > > Corrado, came up with a simpler approach of keeping jbd and fsync processes
> > > on same service tree by parsing RQ_NOIDLE flag. By queuing on same service
> > > tree, one queue preempts other queue hence cutting down on idling time.
> > > Upstream went ahead with simpler approach to fix the issue.
> > >
> > > commit 749ef9f8423054e326f3a246327ed2db4b6d395f
> > > Author: Corrado Zoccolo <[email protected]>
> > > Date: Mon Sep 20 15:24:50 2010 +0200
> > >
> > > cfq: improve fsync performance for small files
> > >
> > >
> > > Now with cgroups, same problem resurfaces but this time we can not queue
> > > both the processes on same service tree and take advantage of preemption
> > > as separate cgroups have separate service trees and both processes
> > > belong to separate cgroups. We do not allow cross cgroup preemption
> > > as that wil break down the isolation between groups.
> > >
> > > So this patch series resurrects Jeff's solution of file system specifying
> > > the IO dependencies between threads explicitly to the block layer/ioscheduler.
> > > One ioscheduler knows that current queue we are idling on is dependent on
> > > IO from some other queue, CFQ allows dispatch of requests from that other
> > > queue in the context of current active queue.
> > >
> > > So if fysnc thread specifies the dependency on journalling thread, then
> > > when time slice of fsync thread is running, it allows dispatch from
> > > jbd in the time slice of fsync thread. Hence cutting down on idling.
> > >
> > > This patch series seems to be working for me. I did testing for ext4 only.
> > > This series is based on for-3.1/core branch of Jen's block tree.
> > > Konstantin, can you please give it a try and see if it fixes your
> > > issue.
> > >
> > > Any feedback on how to solve this issue is appreciated.
> > Hi Vivek,
> > can we introduce a group think time check in cfq? say in a group the
> > last queue is backed for the group and the queue is a non-idle queue, if
> > the group think time is big, we don't allow the group idle and preempt
> > could happen. The fsync thread is a non-idle queue with Corrado's patch,
> > this allows fast group switch.
>
> In this case regular queue idle is hitting and not group idle. So some
> kind of think time stats probably might be useful for group idle check
> but not necessarily for queue idle.
I thought your problem is group idle issue. fsync uses WRITE_SYNC, which
will make the queue be sync-non-idle because REQ_NOIDLE is set. This is
exactly what Corrado's patch for. a fsync queue itself isn't idle unless
it's the last queue in a group. Am I missing anything?

> Secondly, for this case think time will change. If you stop idling on
> fsync and jbd threads, both will be dispatching IOs fast and both will
> have small thinktime. We will think that thinktime is small so we
> will enable idle. Then there think time will increase as both will
> get blocked behind each other. And then we will removing idling. So
> looks like we will be oscillating between enabling and disabling
> think time.
That is possible, the think time check (even for queues) always has such
issue. Not sure how severe the issue is. Assume jbd will dispatch
several requests and this will make fsync thread think time big.

> If we don't allow idling on sync-no-idle queues, then basic CFQ will
> be broken.
Hmm, CFQ only allows idling on sync queues, sync-no-idle queue isn't
allowed idling.

Thanks,
Shaohua

2011-06-28 02:49:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Mon, Jun 27, 2011 at 04:17:41PM -0400, Vivek Goyal wrote:
> Hi,
>
> Konstantin reported that fsync is very slow with ext4 if fsyncing process
> is in a separate cgroup and one is using CFQ IO scheduler.
>
> https://lkml.org/lkml/2011/6/23/269
>
> Issue seems to be that fsync process is in a separate cgroup and journalling
> thread is in root cgroup. After every IO from fsync, CFQ idles on fysnc
> process queue waiting for more requests to come. But this process is now
> waiting for IO to finish from journaling thread. After waiting for 8ms
> fsync's queue gives way to jbd's queue. Then we start idling on jbd
> thread and new IO from fsync is sitting in a separate queue in a separate
> group.
>
> Bottom line, that after every IO we end up idling on fysnc and jbd thread
> so much that if somebody is doing fsync after every 4K of IO, throughput
> nose dives.
>
> Similar issue had issue come up with-in same cgroup also when "fsync"
> and "jbd" thread were being queued on differnt service trees and idling
> was killing. At that point of time two solutions were proposed. One
> from Jeff Moyer and one from Corrado Zoccolo.
>
> Jeff came up with the idea of coming with block layer API to yield the
> queue if explicitly told by file system, hence cutting down on idling.
>
> https://lkml.org/lkml/2010/7/2/277
>
> Corrado, came up with a simpler approach of keeping jbd and fsync processes
> on same service tree by parsing RQ_NOIDLE flag. By queuing on same service
> tree, one queue preempts other queue hence cutting down on idling time.
> Upstream went ahead with simpler approach to fix the issue.
>
> commit 749ef9f8423054e326f3a246327ed2db4b6d395f
> Author: Corrado Zoccolo <[email protected]>
> Date: Mon Sep 20 15:24:50 2010 +0200
>
> cfq: improve fsync performance for small files
>
>
> Now with cgroups, same problem resurfaces but this time we can not queue
> both the processes on same service tree and take advantage of preemption
> as separate cgroups have separate service trees and both processes
> belong to separate cgroups. We do not allow cross cgroup preemption
> as that wil break down the isolation between groups.
>
> So this patch series resurrects Jeff's solution of file system specifying
> the IO dependencies between threads explicitly to the block layer/ioscheduler.
> One ioscheduler knows that current queue we are idling on is dependent on
> IO from some other queue, CFQ allows dispatch of requests from that other
> queue in the context of current active queue.

Vivek, I'm not sure this is a general solution. If we hand journal
IO off to a workqueue, then we've got no idea what the "dependent
task" is.

I bring this up as I have a current patchset that moves all the XFS
journal IO out of process context into a workqueue to solve
process-visible operation latency (e.g. 1000 mkdir syscalls run at
1ms each, the 1001st triggers a journal checkpoint and takes 500ms)
and background checkpoint submission races. This effectively means
that XFS will trigger the same bad CFQ behaviour on fsync, but have
no means of avoiding it because we don't have a specific task to
yield to.

And FWIW, we're going to be using workqueues more and more in XFS
for asynchronous processing of operations. I'm looking to use WQs
for speculative readahead of inodes, all our delayed metadata
writeback, log IO submission, free space allocation requests,
background inode allocation, background inode freeing, background
EOF truncation, etc to process as much work asynchronously outside
syscall context as possible (let's use all those CPU cores we
have!).

All of these things will push potentially dependent IO operations
outside of the bounds of the process actually doing the operation,
so some general solution to the "dependent IO in an undefined thread
context" problem really needs to be solved sooner rather than
later...

As it is, I don't have any good ideas of how to solve this, but I
thought it is worth bringing to your attention while you are trying
to solve a similar issue.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-28 11:02:14

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

Vivek Goyal wrote:
> This patch series seems to be working for me. I did testing for ext4 only.
> This series is based on for-3.1/core branch of Jen's block tree.
> Konstantin, can you please give it a try and see if it fixes your
> issue.

It works for me too, for ext3 and ext4, on top 3.0-rc5, after these trivial fixes:

--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1511,7 +1519,7 @@ static void cfq_add_rq_rb(struct request *rq)
* if that happens, put the alias on the dispatch list
*/
while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
- cfq_dispatch_insert(cfqd->queue, __alias);
+ cfq_dispatch_insert(cfqd->queue, __alias, false);

if (!cfq_cfqq_on_rr(cfqq))
cfq_add_cfqq_rr(cfqd, cfqq);
@@ -3797,12 +3797,11 @@ cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
*/
rcu_read_lock();
if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
- return;
- rcu_read_unlock();
+ goto out_unlock_rcu;

cic = cfq_cic_lookup(cfqd, current->io_context);
if (!cic)
- return;
+ goto out_unlock_rcu;

task_lock(tsk);
spin_lock_irq(q->queue_lock);
@@ -3847,6 +3846,8 @@ cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
out_unlock:
spin_unlock_irq(q->queue_lock);
task_unlock(tsk);
+out_unlock_rcu:
+ rcu_read_unlock();
}

2011-06-28 13:06:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 10:03:54AM +0800, Shaohua Li wrote:

[..]
> > > > Any feedback on how to solve this issue is appreciated.
> > > Hi Vivek,
> > > can we introduce a group think time check in cfq? say in a group the
> > > last queue is backed for the group and the queue is a non-idle queue, if
> > > the group think time is big, we don't allow the group idle and preempt
> > > could happen. The fsync thread is a non-idle queue with Corrado's patch,
> > > this allows fast group switch.
> >
> > In this case regular queue idle is hitting and not group idle. So some
> > kind of think time stats probably might be useful for group idle check
> > but not necessarily for queue idle.
> I thought your problem is group idle issue. fsync uses WRITE_SYNC, which
> will make the queue be sync-non-idle because REQ_NOIDLE is set. This is
> exactly what Corrado's patch for. a fsync queue itself isn't idle unless
> it's the last queue in a group. Am I missing anything?

We idle on last queue on sync-noidle tree. So we idle on fysnc queue as
it is last queue on sync-noidle tree. That's how we provide protection
to all sync-noidle queues against sync-idle queues. Instead of idling
on individual quues we do idling in group and that is on service tree.

>
> > Secondly, for this case think time will change. If you stop idling on
> > fsync and jbd threads, both will be dispatching IOs fast and both will
> > have small thinktime. We will think that thinktime is small so we
> > will enable idle. Then there think time will increase as both will
> > get blocked behind each other. And then we will removing idling. So
> > looks like we will be oscillating between enabling and disabling
> > think time.
> That is possible, the think time check (even for queues) always has such
> issue. Not sure how severe the issue is. Assume jbd will dispatch
> several requests and this will make fsync thread think time big.
>
> > If we don't allow idling on sync-no-idle queues, then basic CFQ will
> > be broken.
> Hmm, CFQ only allows idling on sync queues, sync-no-idle queue isn't
> allowed idling.

See above.

Thanks
Vivek

2011-06-28 13:37:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 12:47:38PM +1000, Dave Chinner wrote:
>
> Vivek, I'm not sure this is a general solution. If we hand journal
> IO off to a workqueue, then we've got no idea what the "dependent
> task" is.
>
> I bring this up as I have a current patchset that moves all the XFS
> journal IO out of process context into a workqueue to solve
> process-visible operation latency (e.g. 1000 mkdir syscalls run at
> 1ms each, the 1001st triggers a journal checkpoint and takes 500ms)
> and background checkpoint submission races. This effectively means
> that XFS will trigger the same bad CFQ behaviour on fsync, but have
> no means of avoiding it because we don't have a specific task to
> yield to.
>
> And FWIW, we're going to be using workqueues more and more in XFS
> for asynchronous processing of operations. I'm looking to use WQs
> for speculative readahead of inodes, all our delayed metadata
> writeback, log IO submission, free space allocation requests,
> background inode allocation, background inode freeing, background
> EOF truncation, etc to process as much work asynchronously outside
> syscall context as possible (let's use all those CPU cores we
> have!).
>
> All of these things will push potentially dependent IO operations
> outside of the bounds of the process actually doing the operation,
> so some general solution to the "dependent IO in an undefined thread
> context" problem really needs to be solved sooner rather than
> later...
>
> As it is, I don't have any good ideas of how to solve this, but I
> thought it is worth bringing to your attention while you are trying
> to solve a similar issue.

Dave,

Coule of thoughts.

- We can introduce anohter block layer call were dependencies are setup
from worker thread context. So when the process schedules the work, it can
save the task information somewhere and when the worker thread actually
calls the specified funciton, that function can setup the dependency
between worker thread and submitting task.

Probably original process can tear down the dependency connection
when IO is done. I am assuming that IO submitting process is waiting
for all IO to finish.

In current framework one can specify multiple processes being dependent
on one thread but not vice-a-versa. I think we should be able to
handle that by maintaining a linked list of dependent queues instead
of single pointer. So if a process submits a bunch of jobs with help
of bunch of worker threads from multiple cpus, I think that case is
manageable with some extension to current patches.

- Or we can also try to do something more exotic and that when we schedule
a work, one should be able to tell which cgroup the worker should run in.
When the worker actually runs, it can migrate itself to destination
destination cgroup and submit IO. This does not take care of cases like
journalling thread where multiple processes are dependent on single
kernel thread. In that case above dependent queue solution should work
well.

So I think above API can be extended to handle the case of work queues
also or we could look into migrating worker in user specified cgroup if
that turns out to be a better solution.

Thanks
Vivek

2011-06-28 13:45:52

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 03:00:39PM +0400, Konstantin Khlebnikov wrote:
> Vivek Goyal wrote:
> >This patch series seems to be working for me. I did testing for ext4 only.
> >This series is based on for-3.1/core branch of Jen's block tree.
> >Konstantin, can you please give it a try and see if it fixes your
> >issue.
>
> It works for me too, for ext3 and ext4, on top 3.0-rc5, after these trivial fixes:
>
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1511,7 +1519,7 @@ static void cfq_add_rq_rb(struct request *rq)
> * if that happens, put the alias on the dispatch list
> */
> while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
> - cfq_dispatch_insert(cfqd->queue, __alias);
> + cfq_dispatch_insert(cfqd->queue, __alias, false);
>
> if (!cfq_cfqq_on_rr(cfqq))
> cfq_add_cfqq_rr(cfqd, cfqq);
> @@ -3797,12 +3797,11 @@ cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
> */
> rcu_read_lock();
> if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
> - return;
> - rcu_read_unlock();
> + goto out_unlock_rcu;
>
> cic = cfq_cic_lookup(cfqd, current->io_context);
> if (!cic)
> - return;
> + goto out_unlock_rcu;

You have done this change because you want to keep cfq_cic_lookup() also
in rcu read side critical section? I am assuming that it works even
without this. Though keeping it under rcu is probably more correct as
cic objects are freed in rcu manner.

Thanks
Vivek

2011-06-28 14:44:00

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 03:00:39PM +0400, Konstantin Khlebnikov wrote:
>> Vivek Goyal wrote:
>>> This patch series seems to be working for me. I did testing for ext4 only.
>>> This series is based on for-3.1/core branch of Jen's block tree.
>>> Konstantin, can you please give it a try and see if it fixes your
>>> issue.
>>
>> It works for me too, for ext3 and ext4, on top 3.0-rc5, after these trivial fixes:
>>
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1511,7 +1519,7 @@ static void cfq_add_rq_rb(struct request *rq)
>> * if that happens, put the alias on the dispatch list
>> */
>> while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
>> - cfq_dispatch_insert(cfqd->queue, __alias);
>> + cfq_dispatch_insert(cfqd->queue, __alias, false);
>>
>> if (!cfq_cfqq_on_rr(cfqq))
>> cfq_add_cfqq_rr(cfqd, cfqq);
>> @@ -3797,12 +3797,11 @@ cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
>> */
>> rcu_read_lock();
>> if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
>> - return;
>> - rcu_read_unlock();
>> + goto out_unlock_rcu;
>>
>> cic = cfq_cic_lookup(cfqd, current->io_context);
>> if (!cic)
>> - return;
>> + goto out_unlock_rcu;
>
> You have done this change because you want to keep cfq_cic_lookup() also
> in rcu read side critical section? I am assuming that it works even
> without this. Though keeping it under rcu is probably more correct as
> cic objects are freed in rcu manner.

No, your just forgot to relese rcu in case task_blkio_cgroup(current) == task_blkio_cgroup(tsk)

>
> Thanks
> Vivek

2011-06-28 14:50:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 06:42:57PM +0400, Konstantin Khlebnikov wrote:
> Vivek Goyal wrote:
> >On Tue, Jun 28, 2011 at 03:00:39PM +0400, Konstantin Khlebnikov wrote:
> >>Vivek Goyal wrote:
> >>>This patch series seems to be working for me. I did testing for ext4 only.
> >>>This series is based on for-3.1/core branch of Jen's block tree.
> >>>Konstantin, can you please give it a try and see if it fixes your
> >>>issue.
> >>
> >>It works for me too, for ext3 and ext4, on top 3.0-rc5, after these trivial fixes:
> >>
> >>--- a/block/cfq-iosched.c
> >>+++ b/block/cfq-iosched.c
> >>@@ -1511,7 +1519,7 @@ static void cfq_add_rq_rb(struct request *rq)
> >> * if that happens, put the alias on the dispatch list
> >> */
> >> while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
> >>- cfq_dispatch_insert(cfqd->queue, __alias);
> >>+ cfq_dispatch_insert(cfqd->queue, __alias, false);
> >>
> >> if (!cfq_cfqq_on_rr(cfqq))
> >> cfq_add_cfqq_rr(cfqd, cfqq);
> >>@@ -3797,12 +3797,11 @@ cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
> >> */
> >> rcu_read_lock();
> >> if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
> >>- return;
> >>- rcu_read_unlock();
> >>+ goto out_unlock_rcu;
> >>
> >> cic = cfq_cic_lookup(cfqd, current->io_context);
> >> if (!cic)
> >>- return;
> >>+ goto out_unlock_rcu;
> >
> >You have done this change because you want to keep cfq_cic_lookup() also
> >in rcu read side critical section? I am assuming that it works even
> >without this. Though keeping it under rcu is probably more correct as
> >cic objects are freed in rcu manner.
>
> No, your just forgot to relese rcu in case task_blkio_cgroup(current) == task_blkio_cgroup(tsk)

Oh, that's right. Thanks for catching this. I will fix it.

Thanks
Vivek

2011-06-28 21:20:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, Jun 28, 2011 at 10:47:44AM -0400, Vivek Goyal wrote:
[..]
> > >> rcu_read_lock();
> > >> if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk))
> > >>- return;
> > >>- rcu_read_unlock();
> > >>+ goto out_unlock_rcu;
> > >>
> > >> cic = cfq_cic_lookup(cfqd, current->io_context);
> > >> if (!cic)
> > >>- return;
> > >>+ goto out_unlock_rcu;
> > >
> > >You have done this change because you want to keep cfq_cic_lookup() also
> > >in rcu read side critical section? I am assuming that it works even
> > >without this. Though keeping it under rcu is probably more correct as
> > >cic objects are freed in rcu manner.
> >
> > No, your just forgot to relese rcu in case task_blkio_cgroup(current) == task_blkio_cgroup(tsk)
>
> Oh, that's right. Thanks for catching this. I will fix it.

Here is the fixed version of the patch. I am now calling rcu_read_unlock()
before returning if cgroups are same.

block: A new interface for specifying IO dependencing among tasks

This patch introduces two new block layer interfaces which allows file
systems to specify IO dependencies among processes. For example, fsync
process is dependent on IO completion from journallig thread in ext3
and ext4.

blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
blk_reset_depends_on_task(struct request_queue *q)

First one allows one to say that current process is dependent on
IO completions from "tsk". This call will let CFQ know abou the
dependency and CFQ will allow "tsk"'s IO to be dispatched in the
allocated time slice of calling process.

Once the dependency is over, one needs to tear down the mapping
by calling reset function.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-core.c | 42 ++++++++
block/cfq-iosched.c | 238 +++++++++++++++++++++++++++++++++++++++++++----
block/elevator.c | 16 +++
include/linux/blkdev.h | 8 +
include/linux/elevator.h | 6 +
5 files changed, 293 insertions(+), 17 deletions(-)

Index: linux-2.6-block/block/cfq-iosched.c
===================================================================
--- linux-2.6-block.orig/block/cfq-iosched.c 2011-06-28 16:32:45.672242961 -0400
+++ linux-2.6-block/block/cfq-iosched.c 2011-06-28 16:35:35.571928038 -0400
@@ -75,6 +75,8 @@ static DEFINE_IDA(cic_index_ida);
#define sample_valid(samples) ((samples) > 80)
#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)

+static void cfq_put_request(struct request *rq);
+
/*
* Most of our rbtree usage is for sorting with min extraction, so
* if we cache the leftmost node we don't have to walk down the tree
@@ -148,6 +150,12 @@ struct cfq_queue {
struct cfq_group *cfqg;
/* Number of sectors dispatched from queue in single dispatch round */
unsigned long nr_sectors;
+
+ /*
+ * This cfqq's further IO is dependent on other IO queued in other
+ * cfqq pointed by dep_on_cfqq.
+ */
+ struct cfq_queue *depends_on;
};

/*
@@ -447,7 +455,7 @@ static inline int cfqg_busy_async_queues
+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
}

-static void cfq_dispatch_insert(struct request_queue *, struct request *);
+static void cfq_dispatch_insert(struct request_queue *, struct request *, bool);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
@@ -2043,16 +2051,23 @@ static void cfq_arm_slice_timer(struct c

/*
* Move request from internal lists to the request queue dispatch list.
+ * @rq_dequeued: rq to be dispatched has already been removed from associated
+ * cfqq. This is useful when rq from dependent queue is being
+ * dispatched in current queue context.
*/
-static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
+static void cfq_dispatch_insert(struct request_queue *q, struct request *rq,
+ bool rq_dequeued)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);

cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");

- cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
- cfq_remove_request(rq);
+ if (!rq_dequeued) {
+ cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
+ cfq_remove_request(rq);
+ }
+
cfqq->dispatched++;
(RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);
@@ -2332,6 +2347,10 @@ static struct cfq_queue *cfq_select_queu
if (!RB_EMPTY_ROOT(&cfqq->sort_list))
goto keep_queue;

+ /* There are reuquests in the cfqq we depend on. Allow dispatch */
+ if (cfqq->depends_on && !RB_EMPTY_ROOT(&cfqq->depends_on->sort_list))
+ goto keep_queue;
+
/*
* If another queue has a request waiting within our mean seek
* distance, let it run. The expire code will check for close
@@ -2402,7 +2421,7 @@ static int __cfq_forced_dispatch_cfqq(st
int dispatched = 0;

while (cfqq->next_rq) {
- cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq);
+ cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq, false);
dispatched++;
}

@@ -2534,6 +2553,77 @@ static bool cfq_may_dispatch(struct cfq_
}

/*
+ * This queue was not active and we might expire it becaue its request got
+ * dispatched in some other queue's context and it is an empty queue now
+ */
+static void
+cfq_expire_inactive_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ /*
+ * If this cfqq is shared between multiple processes, check to
+ * make sure that those processes are still issuing I/Os within
+ * the mean seek distance. If not, it may be time to break the
+ * queues apart again.
+ */
+ if (cfq_cfqq_coop(cfqq) && CFQQ_SEEKY(cfqq))
+ cfq_mark_cfqq_split_coop(cfqq);
+
+ cfq_del_cfqq_rr(cfqd, cfqq);
+}
+
+static void cfq_set_dispatch_dependent_request(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq)
+{
+ struct cfq_queue *dep_cfqq;
+ int rw;
+ struct request *rq;
+
+ dep_cfqq = cfqq->depends_on;
+
+ cfq_log_cfqq(cfqd, cfqq, "dispatch from dependent"
+ " queue pid=%d\n", dep_cfqq->pid);
+ /*
+ * Select a request from the queue we are dependent on. Dequeue
+ * the request from other queue and make rq belong to this
+ * queue and dispatch in this queue's context
+ */
+ rq = cfq_check_fifo(dep_cfqq);
+ if (!rq)
+ rq = dep_cfqq->next_rq;
+ cfq_remove_request(rq);
+ if (RB_EMPTY_ROOT(&dep_cfqq->sort_list))
+ cfq_expire_inactive_queue(cfqd, dep_cfqq);
+ /*
+ * Change the identity of request to belong to current cfqq
+ * cfqg and cic. Drop references to old cfqq, cfqg and cic.
+ */
+ cfqq->ref++;
+ rw = rq_data_dir(rq);
+ cfqq->allocated[rw]++;
+
+ /*
+ * If we are here that means we are idling on the queue and we
+ * must have dispatched atleast one request and that must have
+ * set the cfqd->active_cic. Use that
+ */
+ BUG_ON(!cfqd->active_cic);
+ atomic_long_inc(&cfqd->active_cic->ioc->refcount);
+
+ cfq_put_request(rq);
+
+ rq->elevator_private[0] = cfqd->active_cic;
+ rq->elevator_private[1] = cfqq;
+ rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+
+ if (cfq_cfqq_wait_request(cfqq))
+ cfq_del_timer(cfqd, cfqq);
+
+ cfq_clear_cfqq_wait_request(cfqq);
+
+ cfq_dispatch_insert(cfqd->queue, rq, true);
+}
+
+/*
* Dispatch a request from cfqq, moving them to the request queue
* dispatch list.
*/
@@ -2541,22 +2631,26 @@ static bool cfq_dispatch_request(struct
{
struct request *rq;

- BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
+ BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list) && !cfqq->depends_on);

if (!cfq_may_dispatch(cfqd, cfqq))
return false;

- /*
- * follow expired path, else get first next available
- */
- rq = cfq_check_fifo(cfqq);
- if (!rq)
- rq = cfqq->next_rq;
+ if (RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_set_dispatch_dependent_request(cfqd, cfqq);
+ else {
+ /*
+ * follow expired path, else get first next available
+ */
+ rq = cfq_check_fifo(cfqq);
+ if (!rq)
+ rq = cfqq->next_rq;

- /*
- * insert request into driver dispatch list
- */
- cfq_dispatch_insert(cfqd->queue, rq);
+ /*
+ * insert request into driver dispatch list
+ */
+ cfq_dispatch_insert(cfqd->queue, rq, false);
+ }

if (!cfqd->active_cic) {
struct cfq_io_context *cic = RQ_CIC(rq);
@@ -2640,6 +2734,11 @@ static void cfq_put_queue(struct cfq_que
}

BUG_ON(cfq_cfqq_on_rr(cfqq));
+
+ /* This cfqq is going away. If there is a dependent queue, drop ref */
+ if (cfqq->depends_on)
+ cfq_put_queue(cfqq->depends_on);
+
kmem_cache_free(cfq_pool, cfqq);
cfq_put_cfqg(cfqg);
}
@@ -3670,6 +3769,111 @@ static int cfq_may_queue(struct request_
}

/*
+ * Calling task depends on "tsk" for further IO. It is caller's responsibility
+ * to make sure tsk pointer is valid during the execution of call
+ */
+static void
+cfq_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ struct cfq_io_context *cic, *tsk_cic;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq, *tsk_cfqq, *__cfqq;
+
+ if (unlikely(!tsk))
+ return;
+
+ if (!tsk->io_context || !current->io_context)
+ return;
+
+ /*
+ * If two processes belong to same cgroup, no need to do this as
+ * both journalling thread and fsync process will go on same
+ * service tree and be able to preempt each other
+ */
+ rcu_read_lock();
+ if (task_blkio_cgroup(current) == task_blkio_cgroup(tsk)) {
+ rcu_read_unlock();
+ return;
+ }
+ rcu_read_unlock();
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ task_lock(tsk);
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq)
+ goto out_unlock;
+
+ /* If cfqq already has a dependent queue, ignore this new queue */
+ if (cfqq->depends_on) {
+ cfq_log_cfqq(cfqd, cfqq, "depends on queue already set"
+ " old_pid=%d", cfqq->depends_on->pid);
+ goto out_unlock;
+ }
+
+ if (!tsk->io_context)
+ goto out_unlock;
+
+ tsk_cic = cfq_cic_lookup(cfqd, tsk->io_context);
+ if (!tsk_cic)
+ goto out_unlock;
+
+ tsk_cfqq = cic_to_cfqq(tsk_cic, 1);
+
+ if (!tsk_cfqq)
+ goto out_unlock;
+
+ /* Don't allow circular dependency among a group of queues */
+ __cfqq = tsk_cfqq;
+
+ while((__cfqq = __cfqq->depends_on)) {
+ if (__cfqq == cfqq)
+ goto out_unlock;
+ }
+
+ /* Take reference on tasks' cfqq */
+ tsk_cfqq->ref++;
+ cfqq->depends_on = tsk_cfqq;
+
+ cfq_log_cfqq(cfqd, cfqq, "set depends on queue pid= %d", tsk->pid);
+
+out_unlock:
+ spin_unlock_irq(q->queue_lock);
+ task_unlock(tsk);
+}
+
+static void cfq_reset_depends_on_task(struct request_queue *q)
+{
+ struct cfq_io_context *cic;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq;
+
+ cic = cfq_cic_lookup(cfqd, current->io_context);
+ if (!cic)
+ return;
+
+ spin_lock_irq(q->queue_lock);
+
+ cfqq = cic_to_cfqq(cic, 1);
+ if (!cfqq || !cfqq->depends_on) {
+ spin_unlock_irq(q->queue_lock);
+ return;
+ }
+
+ cfq_log_cfqq(cfqd, cfqq, "reset depends on queue");
+
+ /* Drop refenrece on tasks's cfqq */
+ cfq_put_queue(cfqq->depends_on);
+ cfqq->depends_on = NULL;
+ spin_unlock_irq(q->queue_lock);
+
+}
+
+/*
* queue lock held here
*/
static void cfq_put_request(struct request *rq)
@@ -4206,6 +4410,8 @@ static struct elevator_type iosched_cfq
.elevator_set_req_fn = cfq_set_request,
.elevator_put_req_fn = cfq_put_request,
.elevator_may_queue_fn = cfq_may_queue,
+ .elevator_set_depends_on_task_fn = cfq_set_depends_on_task,
+ .elevator_reset_depends_on_task_fn = cfq_reset_depends_on_task,
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
Index: linux-2.6-block/block/blk-core.c
===================================================================
--- linux-2.6-block.orig/block/blk-core.c 2011-06-28 10:02:23.544839889 -0400
+++ linux-2.6-block/block/blk-core.c 2011-06-28 16:33:13.703491049 -0400
@@ -398,6 +398,19 @@ static int blk_init_free_list(struct req
return 0;
}

+static void
+generic_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->elevator)
+ elv_set_depends_on_task(q, tsk);
+}
+
+static void generic_reset_depends_on_task(struct request_queue *q)
+{
+ if (q->elevator)
+ elv_reset_depends_on_task(q);
+}
+
struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
{
return blk_alloc_queue_node(gfp_mask, -1);
@@ -534,6 +547,8 @@ blk_init_allocated_queue_node(struct req
q->prep_rq_fn = NULL;
q->unprep_rq_fn = NULL;
q->queue_flags = QUEUE_FLAG_DEFAULT;
+ q->set_depends_on_task_fn = generic_set_depends_on_task;
+ q->reset_depends_on_task_fn = generic_reset_depends_on_task;

/* Override internal queue lock with supplied lock pointer */
if (lock)
@@ -2766,6 +2781,33 @@ void blk_finish_plug(struct blk_plug *pl
}
EXPORT_SYMBOL(blk_finish_plug);

+/*
+ * Give a hint to CFQ that current task is dependent on IO coming from
+ * "tsk". In such cases, CFQ will allow dispatch from "tsk" queue in
+ * the time slice of "current" process and this can cut down on
+ * unnecessarily queue idling.
+ *
+ * This function will return with interrupts disabled in case of CFQ.
+ * (task_lock()/task_unlock() pair).
+ */
+void blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ if (q->set_depends_on_task_fn)
+ q->set_depends_on_task_fn(q, tsk);
+}
+EXPORT_SYMBOL(blk_set_depends_on_task);
+
+/*
+ * Tear down the any dependent task mapping previously set up by the current
+ * task.
+ */
+void blk_reset_depends_on_task(struct request_queue *q)
+{
+ if (q->reset_depends_on_task_fn)
+ q->reset_depends_on_task_fn(q);
+}
+EXPORT_SYMBOL(blk_reset_depends_on_task);
+
int __init blk_dev_init(void)
{
BUILD_BUG_ON(__REQ_NR_BITS > 8 *
Index: linux-2.6-block/block/elevator.c
===================================================================
--- linux-2.6-block.orig/block/elevator.c 2011-06-28 16:32:45.673243005 -0400
+++ linux-2.6-block/block/elevator.c 2011-06-28 16:33:13.705491137 -0400
@@ -783,6 +783,22 @@ int elv_may_queue(struct request_queue *
return ELV_MQUEUE_MAY;
}

+void elv_set_depends_on_task(struct request_queue *q, struct task_struct *tsk)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_set_depends_on_task_fn)
+ e->ops->elevator_set_depends_on_task_fn(q, tsk);
+}
+
+void elv_reset_depends_on_task(struct request_queue *q)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_reset_depends_on_task_fn)
+ e->ops->elevator_reset_depends_on_task_fn(q);
+}
+
void elv_abort_queue(struct request_queue *q)
{
struct request *rq;
Index: linux-2.6-block/include/linux/blkdev.h
===================================================================
--- linux-2.6-block.orig/include/linux/blkdev.h 2011-06-28 16:32:46.020258283 -0400
+++ linux-2.6-block/include/linux/blkdev.h 2011-06-28 16:33:13.706491181 -0400
@@ -209,6 +209,8 @@ typedef int (merge_bvec_fn) (struct requ
typedef void (softirq_done_fn)(struct request *);
typedef int (dma_drain_needed_fn)(struct request *);
typedef int (lld_busy_fn) (struct request_queue *q);
+typedef void (set_depends_on_task_fn) (struct request_queue *q, struct task_struct *tsk);
+typedef void (reset_depends_on_task_fn) (struct request_queue *q);

enum blk_eh_timer_return {
BLK_EH_NOT_HANDLED,
@@ -283,7 +285,8 @@ struct request_queue
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
-
+ set_depends_on_task_fn *set_depends_on_task_fn;
+ reset_depends_on_task_fn *reset_depends_on_task_fn;
/*
* Dispatch queue sorting
*/
@@ -895,6 +898,9 @@ static inline bool blk_needs_flush_plug(
return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
}

+extern void blk_set_depends_on_task(struct request_queue *q, struct task_struct *tsk);
+extern void blk_reset_depends_on_task(struct request_queue *q);
+
/*
* tag stuff
*/
Index: linux-2.6-block/include/linux/elevator.h
===================================================================
--- linux-2.6-block.orig/include/linux/elevator.h 2011-06-28 16:32:46.022258371 -0400
+++ linux-2.6-block/include/linux/elevator.h 2011-06-28 16:33:13.707491225 -0400
@@ -23,6 +23,8 @@ typedef void (elevator_add_req_fn) (stru
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);
+typedef void (elevator_set_depends_on_task_fn) (struct request_queue *, struct task_struct *);
+typedef void (elevator_reset_depends_on_task_fn) (struct request_queue *);

typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
typedef void (elevator_put_req_fn) (struct request *);
@@ -54,6 +56,8 @@ struct elevator_ops
elevator_put_req_fn *elevator_put_req_fn;

elevator_may_queue_fn *elevator_may_queue_fn;
+ elevator_set_depends_on_task_fn *elevator_set_depends_on_task_fn;
+ elevator_reset_depends_on_task_fn *elevator_reset_depends_on_task_fn;

elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
@@ -114,6 +118,8 @@ extern struct request *elv_latter_reques
extern int elv_register_queue(struct request_queue *q);
extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
+extern void elv_set_depends_on_task(struct request_queue *q, struct task_struct *tsk);
+extern void elv_reset_depends_on_task(struct request_queue *q);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
extern int elv_set_request(struct request_queue *, struct request *, gfp_t);

2011-06-29 01:05:04

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Tue, 2011-06-28 at 21:04 +0800, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 10:03:54AM +0800, Shaohua Li wrote:
>
> [..]
> > > > > Any feedback on how to solve this issue is appreciated.
> > > > Hi Vivek,
> > > > can we introduce a group think time check in cfq? say in a group the
> > > > last queue is backed for the group and the queue is a non-idle queue, if
> > > > the group think time is big, we don't allow the group idle and preempt
> > > > could happen. The fsync thread is a non-idle queue with Corrado's patch,
> > > > this allows fast group switch.
> > >
> > > In this case regular queue idle is hitting and not group idle. So some
> > > kind of think time stats probably might be useful for group idle check
> > > but not necessarily for queue idle.
> > I thought your problem is group idle issue. fsync uses WRITE_SYNC, which
> > will make the queue be sync-non-idle because REQ_NOIDLE is set. This is
> > exactly what Corrado's patch for. a fsync queue itself isn't idle unless
> > it's the last queue in a group. Am I missing anything?
>
> We idle on last queue on sync-noidle tree. So we idle on fysnc queue as
> it is last queue on sync-noidle tree. That's how we provide protection
> to all sync-noidle queues against sync-idle queues. Instead of idling
> on individual quues we do idling in group and that is on service tree.
Ok. but this looks silly. We are idling in a noidle service tree or a
group (backed by the last queue of the tree or group) because we assume
the tree or group can dispatch a request soon. But if the think time of
the tree or group is big, the assumption isn't true. Doing idle here is
blind. I thought we can extend the think time check for both service
tree and group.

Thanks,
Shaohua

2011-06-29 01:30:04

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Wed, Jun 29, 2011 at 09:04:55AM +0800, Shaohua Li wrote:

[..]
> > We idle on last queue on sync-noidle tree. So we idle on fysnc queue as
> > it is last queue on sync-noidle tree. That's how we provide protection
> > to all sync-noidle queues against sync-idle queues. Instead of idling
> > on individual quues we do idling in group and that is on service tree.
> Ok. but this looks silly. We are idling in a noidle service tree or a
> group (backed by the last queue of the tree or group) because we assume
> the tree or group can dispatch a request soon. But if the think time of
> the tree or group is big, the assumption isn't true. Doing idle here is
> blind. I thought we can extend the think time check for both service
> tree and group.

We can implement the thinktime for noidle service tree and group idle as
well. That's not a problem, though I am yet to be convinced that thinktime
still makes sense for the group. I guess it will just mean that in the
past have you done a bunch of IO with gap between IO less than 8ms. If
yes, then we expect you to do more IO in future. Frankly speaking, I am
not too sure that how past IO pattern predicts the future IO pattern
of the group.

But anyway, the point is, even if you we implement it, it will not solve
the fsync issue at hand. The reason I explained in previous mail. We
will be oscillating between high think time and low thinktime depending
on whether we are idling or not. There is no correlation between think
time of fsync thread and idling here.

I think you are banking on the fact that after fsync, journaling thread
IO can take more than 8ms hence delaying next IO to fsync thread, pushing
its thinktim more than 8ms hence we will not idle on fsync thread at
all. It is just one corner case and I think it is broken in multiple
cases.

- If filesystem barriers are disabled or backend storage has battery
backup then journal IO most likely will go in cache and barriers
will be ignored. In that case write will finish almost instantly
and we will get next IO from fsync thread very soon hence pushing
down thinktime of fsync thread which will enable idling and we will
be back to the problem we are trying to solve.

- Fsync thread might be submitting string of IOs (say 10-12) before it
moves to journal thread to commit meta data. In that case we might
have lowered thinktime of fsync hence enable idle.

So implementing think time for service tree/group might be a good idea
in general but it will not solve this IO dependecny issue across cgroups.

Thanks
Vivek

2011-06-30 00:29:46

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

On Wed, Jun 29, 2011 at 09:29:55AM +0800, Vivek Goyal wrote:
> On Wed, Jun 29, 2011 at 09:04:55AM +0800, Shaohua Li wrote:
>
> [..]
> > > We idle on last queue on sync-noidle tree. So we idle on fysnc queue as
> > > it is last queue on sync-noidle tree. That's how we provide protection
> > > to all sync-noidle queues against sync-idle queues. Instead of idling
> > > on individual quues we do idling in group and that is on service tree.
> > Ok. but this looks silly. We are idling in a noidle service tree or a
> > group (backed by the last queue of the tree or group) because we assume
> > the tree or group can dispatch a request soon. But if the think time of
> > the tree or group is big, the assumption isn't true. Doing idle here is
> > blind. I thought we can extend the think time check for both service
> > tree and group.
>
> We can implement the thinktime for noidle service tree and group idle as
> well. That's not a problem, though I am yet to be convinced that thinktime
> still makes sense for the group. I guess it will just mean that in the
> past have you done a bunch of IO with gap between IO less than 8ms. If
> yes, then we expect you to do more IO in future. Frankly speaking, I am
> not too sure that how past IO pattern predicts the future IO pattern
> of the group.
>
> But anyway, the point is, even if you we implement it, it will not solve
> the fsync issue at hand. The reason I explained in previous mail. We
> will be oscillating between high think time and low thinktime depending
> on whether we are idling or not. There is no correlation between think
> time of fsync thread and idling here.
>
> I think you are banking on the fact that after fsync, journaling thread
> IO can take more than 8ms hence delaying next IO to fsync thread, pushing
> its thinktim more than 8ms hence we will not idle on fsync thread at
> all. It is just one corner case and I think it is broken in multiple
> cases.
>
> - If filesystem barriers are disabled or backend storage has battery
> backup then journal IO most likely will go in cache and barriers
> will be ignored. In that case write will finish almost instantly
> and we will get next IO from fsync thread very soon hence pushing
> down thinktime of fsync thread which will enable idling and we will
> be back to the problem we are trying to solve.
>
> - Fsync thread might be submitting string of IOs (say 10-12) before it
> moves to journal thread to commit meta data. In that case we might
> have lowered thinktime of fsync hence enable idle.
>
> So implementing think time for service tree/group might be a good idea
> in general but it will not solve this IO dependecny issue across cgroups.
Ok, fair enough. I'll give a try and check how things change with the fsync workload.

Thanks,
Shaohua