LinuxLists.cc - [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq

2009-10-23 21:14:55

Subject: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

Hi,

This is a follow-up patch to the original close cooperator support for
CFQ. The problem is that some programs (NFSd, dump(8), iscsi target
mode driver, qemu) interleave sequential I/Os between multiple threads
or processes. The result is that there are large delays due to CFQ's
idling logic that leads to very low throughput. The original patch
partially addresses these problems by detecting close cooperators and
allowing them to jump ahead in the scheduling order. This doesn't work
100% of the time, unfortunately, and you can have some processes in the
group getting way ahead (LBA-wise) of the others, leading to a lot of seeks.

This patch series addresses the problems in the current implementation by
merging cfq_queue's of close cooperators. The results are encouraging:

read-test2 emulates the I/O patterns of dump(8). The following results
are taken from 50 runs of patched, 16 runs of unpatched (I got impatient):

Average Std. Dev.
----------------------------------
Patched CFQ: 88.81773 0.9485
Vanilla CFQ: 12.62678 0.24535

Single streaming reader over NFS, results in MB/s are the average of 2
runs.

|patched|
nfsd's| cfq | cfq | deadline
------+-------+-------+---------
1 | 45 | 45 | 36
2 | 57 | 60 | 60
4 | 38 | 49 | 50
8 | 34 | 40 | 49
16 | 34 | 43 | 53

I've tested that sequential access patterns do trigger the merging of queues,
and that when the I/O becomes random, the cfq_queues are split apart. Comments,
as always, are greatly appreciated.

Cheers,
Jeff

2009-10-23 21:14:59

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 1/4] cfq: calculate the seek_mean per cfq_queue not per cfq_io_context

async cfq_queue's are already shared between processes within the same
priority, and forthcoming patches will change the mapping of cic to sync
cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process
based on the cfq_queue instead of the cfq_io_context.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/cfq-iosched.c | 68 ++++++++++++++++++++++-----------------------
include/linux/iocontext.h | 5 ---
2 files changed, 33 insertions(+), 40 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 069a610..78cc8ee 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -112,6 +112,11 @@ struct cfq_queue {
unsigned short ioprio, org_ioprio;
unsigned short ioprio_class, org_ioprio_class;

+ unsigned int seek_samples;
+ u64 seek_total;
+ sector_t seek_mean;
+ sector_t last_request_pos;
+
pid_t pid;
};

@@ -962,16 +967,16 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
return cfqd->last_position - blk_rq_pos(rq);
}

-#define CIC_SEEK_THR 8 * 1024
-#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEK_THR 8 * 1024
+#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR)

-static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
+static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ struct request *rq)
{
- struct cfq_io_context *cic = cfqd->active_cic;
- sector_t sdist = cic->seek_mean;
+ sector_t sdist = cfqq->seek_mean;

- if (!sample_valid(cic->seek_samples))
- sdist = CIC_SEEK_THR;
+ if (!sample_valid(cfqq->seek_samples))
+ sdist = CFQQ_SEEK_THR;

return cfq_dist_from_last(cfqd, rq) <= sdist;
}
@@ -1000,7 +1005,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* will contain the closest sector.
*/
__cfqq = rb_entry(parent, struct cfq_queue, p_node);
- if (cfq_rq_close(cfqd, __cfqq->next_rq))
+ if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq))
return __cfqq;

if (blk_rq_pos(__cfqq->next_rq) < sector)
@@ -1011,7 +1016,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
return NULL;

__cfqq = rb_entry(node, struct cfq_queue, p_node);
- if (cfq_rq_close(cfqd, __cfqq->next_rq))
+ if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq))
return __cfqq;

return NULL;
@@ -1034,13 +1039,6 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
struct cfq_queue *cfqq;

/*
- * A valid cfq_io_context is necessary to compare requests against
- * the seek_mean of the current cfqq.
- */
- if (!cfqd->active_cic)
- return NULL;
-
- /*
* We should notice if some of the queues are cooperating, eg
* working closely on the same area of the disk. In that case,
* we can group them together and don't waste time idling.
@@ -1110,7 +1108,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* seeks. so allow a little bit of time for him to submit a new rq
*/
sl = cfqd->cfq_slice_idle;
- if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+ if (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
@@ -1947,33 +1945,33 @@ cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
}

static void
-cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
+cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct request *rq)
{
sector_t sdist;
u64 total;

- if (!cic->last_request_pos)
+ if (!cfqq->last_request_pos)
sdist = 0;
- else if (cic->last_request_pos < blk_rq_pos(rq))
- sdist = blk_rq_pos(rq) - cic->last_request_pos;
+ else if (cfqq->last_request_pos < blk_rq_pos(rq))
+ sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
else
- sdist = cic->last_request_pos - blk_rq_pos(rq);
+ sdist = cfqq->last_request_pos - blk_rq_pos(rq);

/*
* Don't allow the seek distance to get too large from the
* odd fragment, pagein, etc
*/
- if (cic->seek_samples <= 60) /* second&third seek */
- sdist = min(sdist, (cic->seek_mean * 4) + 2*1024*1024);
+ if (cfqq->seek_samples <= 60) /* second&third seek */
+ sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024);
else
- sdist = min(sdist, (cic->seek_mean * 4) + 2*1024*64);
+ sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64);

- cic->seek_samples = (7*cic->seek_samples + 256) / 8;
- cic->seek_total = (7*cic->seek_total + (u64)256*sdist) / 8;
- total = cic->seek_total + (cic->seek_samples/2);
- do_div(total, cic->seek_samples);
- cic->seek_mean = (sector_t)total;
+ cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8;
+ cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8;
+ total = cfqq->seek_total + (cfqq->seek_samples/2);
+ do_div(total, cfqq->seek_samples);
+ cfqq->seek_mean = (sector_t)total;
}

/*
@@ -1995,11 +1993,11 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_latency && cfqd->hw_tag && CIC_SEEKY(cic)))
+ (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
unsigned int slice_idle = cfqd->cfq_slice_idle;
- if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+ if (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))
slice_idle = msecs_to_jiffies(CFQ_MIN_TT);
if (cic->ttime_mean > slice_idle)
enable_idle = 0;
@@ -2066,7 +2064,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
* if this request is as-good as one we would expect from the
* current cfqq, let it preempt
*/
- if (cfq_rq_close(cfqd, rq))
+ if (cfq_rq_close(cfqd, cfqq, rq))
return true;

return false;
@@ -2108,10 +2106,10 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->meta_pending++;

cfq_update_io_thinktime(cfqd, cic);
- cfq_update_io_seektime(cfqd, cic, rq);
+ cfq_update_io_seektime(cfqd, cfqq, rq);
cfq_update_idle_window(cfqd, cfqq, cic);

- cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+ cfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);

if (cfqq == cfqd->active_queue) {
/*
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..eb73632 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -40,16 +40,11 @@ struct cfq_io_context {
struct io_context *ioc;

unsigned long last_end_request;
- sector_t last_request_pos;

unsigned long ttime_total;
unsigned long ttime_samples;
unsigned long ttime_mean;

- unsigned int seek_samples;
- u64 seek_total;
- sector_t seek_mean;
-
struct list_head queue_list;
struct hlist_node cic_list;

--
1.6.2.5

2009-10-23 21:15:36

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 2/4] cfq: merge cooperating cfq_queues

When cooperating cfq_queues are detected currently, they are allowed to
skip ahead in the scheduling order. It is much more efficient to
automatically share the cfq_queue data structure between cooperating processes.
Performance of the read-test2 benchmark (which is written to emulate the
dump(8) utility) went from 12MB/s to 90MB/s on my SATA disk. NFS servers
with multiple nfsd threads also saw performance increases.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/cfq-iosched.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 78cc8ee..f0994ae 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -118,6 +118,8 @@ struct cfq_queue {
sector_t last_request_pos;

pid_t pid;
+
+ struct cfq_queue *new_cfqq;
};

/*
@@ -1047,6 +1049,12 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!cfqq)
return NULL;

+ /*
+ * It only makes sense to merge sync queues.
+ */
+ if (!cfq_cfqq_sync(cfqq))
+ return NULL;
+
if (cfq_cfqq_coop(cfqq))
return NULL;

@@ -1168,6 +1176,43 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
}

/*
+ * Must be called with the queue_lock held.
+ */
+static int cfqq_process_refs(struct cfq_queue *cfqq)
+{
+ int process_refs, io_refs;
+
+ io_refs = cfqq->allocated[READ] + cfqq->allocated[WRITE];
+ process_refs = atomic_read(&cfqq->ref) - io_refs;
+ BUG_ON(process_refs < 0);
+ return process_refs;
+}
+
+static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
+{
+ int process_refs;
+ struct cfq_queue *__cfqq;
+
+ /* Avoid a circular list and skip interim queue merges */
+ while ((__cfqq = new_cfqq->new_cfqq)) {
+ if (__cfqq == cfqq)
+ return;
+ new_cfqq = __cfqq;
+ }
+
+ process_refs = cfqq_process_refs(cfqq);
+ /*
+ * If the process for the cfqq has gone away, there is no
+ * sense in merging the queues.
+ */
+ if (process_refs == 0)
+ return;
+
+ cfqq->new_cfqq = new_cfqq;
+ atomic_add(process_refs, &new_cfqq->ref);
+}
+
+/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
*/
@@ -1196,11 +1241,14 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
* If another queue has a request waiting within our mean seek
* distance, let it run. The expire code will check for close
* cooperators and put the close queue at the front of the service
- * tree.
+ * tree. If possible, merge the expiring queue with the new cfqq.
*/
new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
- if (new_cfqq)
+ if (new_cfqq) {
+ if (!cfqq->new_cfqq)
+ cfq_setup_merge(cfqq, new_cfqq);
goto expire;
+ }

/*
* No requests pending. If the active queue still has requests in
@@ -1511,11 +1559,29 @@ static void cfq_free_io_context(struct io_context *ioc)

static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ struct cfq_queue *__cfqq, *next;
+
if (unlikely(cfqq == cfqd->active_queue)) {
__cfq_slice_expired(cfqd, cfqq, 0);
cfq_schedule_dispatch(cfqd);
}

+ /*
+ * If this queue was scheduled to merge with another queue, be
+ * sure to drop the reference taken on that queue (and others in
+ * the merge chain). See cfq_setup_merge and cfq_merge_cfqqs.
+ */
+ __cfqq = cfqq->new_cfqq;
+ while (__cfqq) {
+ if (__cfqq == cfqq) {
+ WARN(1, "cfqq->new_cfqq loop detected\n");
+ break;
+ }
+ next = __cfqq->new_cfqq;
+ cfq_put_queue(__cfqq);
+ __cfqq = next;
+ }
+
cfq_put_queue(cfqq);
}

@@ -2323,6 +2389,16 @@ static void cfq_put_request(struct request *rq)
}
}

+static struct cfq_queue *
+cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
+ struct cfq_queue *cfqq)
+{
+ cfq_log_cfqq(cfqd, cfqq, "merging with queue %p", cfqq->new_cfqq);
+ cic_set_cfqq(cic, cfqq->new_cfqq, 1);
+ cfq_put_queue(cfqq);
+ return cic_to_cfqq(cic, 1);
+}
+
/*
* Allocate cfq data structures associated with this request.
*/
@@ -2349,6 +2425,15 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
+ } else {
+ /*
+ * Check to see if this queue is scheduled to merge with
+ * another, closely cooperating queue. The merging of
+ * queues happens here as it must be done in process context.
+ * The reference on new_cfqq was taken in merge_cfqqs.
+ */
+ if (cfqq->new_cfqq)
+ cfqq = cfq_merge_cfqqs(cfqd, cic, cfqq);
}

cfqq->allocated[rw]++;
--
1.6.2.5

2009-10-23 21:15:52

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 3/4] cfq: change the meaning of the cfqq_coop flag

The flag used to indicate that a cfqq was allowed to jump ahead in the
scheduling order due to submitting a request close to the queue that
just executed. Since closely cooperating queues are now merged, the flag
holds little meaning. Change it to indicate that multiple queues were
merged. This will later be used to allow the breaking up of merged queues
when they are no longer cooperating.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/cfq-iosched.c | 20 ++++++--------------
1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f0994ae..5e01a0a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -202,7 +202,7 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
CFQ_CFQQ_FLAG_sync, /* synchronous queue */
- CFQ_CFQQ_FLAG_coop, /* has done a coop jump of the queue */
+ CFQ_CFQQ_FLAG_coop, /* cfqq is shared */
};

#define CFQ_CFQQ_FNS(name) \
@@ -950,11 +950,8 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
struct cfq_queue *cfqq)
{
- if (!cfqq) {
+ if (!cfqq)
cfqq = cfq_get_next_queue(cfqd);
- if (cfqq)
- cfq_clear_cfqq_coop(cfqq);
- }

__cfq_set_active_queue(cfqd, cfqq);
return cfqq;
@@ -1035,8 +1032,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* assumption.
*/
static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
- struct cfq_queue *cur_cfqq,
- bool probe)
+ struct cfq_queue *cur_cfqq)
{
struct cfq_queue *cfqq;

@@ -1055,11 +1051,6 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!cfq_cfqq_sync(cfqq))
return NULL;

- if (cfq_cfqq_coop(cfqq))
- return NULL;
-
- if (!probe)
- cfq_mark_cfqq_coop(cfqq);
return cfqq;
}

@@ -1243,7 +1234,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
* cooperators and put the close queue at the front of the service
* tree. If possible, merge the expiring queue with the new cfqq.
*/
- new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
+ new_cfqq = cfq_close_cooperator(cfqd, cfqq);
if (new_cfqq) {
if (!cfqq->new_cfqq)
cfq_setup_merge(cfqq, new_cfqq);
@@ -2294,7 +2285,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
cfq_slice_expired(cfqd, 1);
- else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
+ else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq) &&
sync && !rq_noidle(rq))
cfq_arm_slice_timer(cfqd);
}
@@ -2395,6 +2386,7 @@ cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
{
cfq_log_cfqq(cfqd, cfqq, "merging with queue %p", cfqq->new_cfqq);
cic_set_cfqq(cic, cfqq->new_cfqq, 1);
+ cfq_mark_cfqq_coop(cfqq->new_cfqq);
cfq_put_queue(cfqq);
return cic_to_cfqq(cic, 1);
}
--
1.6.2.5

2009-10-23 21:14:58

by Jeff Moyer

[permalink] [raw]

Subject: [PATCH 4/4] cfq: break apart merged cfqqs if they stop cooperating

cfq_queues are merged if they are issuing requests within the mean seek
distance of one another. This patch detects when the coopearting stops and
breaks the queues back up.

Signed-off-by: Jeff Moyer <[email protected]>
---
block/cfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5e01a0a..47d6aac 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -38,6 +38,12 @@ static int cfq_slice_idle = HZ / 125;
*/
#define CFQ_MIN_TT (2)

+/*
+ * Allow merged cfqqs to perform this amount of seeky I/O before
+ * deciding to break the queues up again.
+ */
+#define CFQQ_COOP_TOUT (HZ)
+
#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)

@@ -116,6 +122,7 @@ struct cfq_queue {
u64 seek_total;
sector_t seek_mean;
sector_t last_request_pos;
+ unsigned long seeky_start;

pid_t pid;

@@ -1036,6 +1043,11 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
{
struct cfq_queue *cfqq;

+ if (!cfq_cfqq_sync(cur_cfqq))
+ return NULL;
+ if (CFQQ_SEEKY(cur_cfqq))
+ return NULL;
+
/*
* We should notice if some of the queues are cooperating, eg
* working closely on the same area of the disk. In that case,
@@ -1050,6 +1062,8 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
*/
if (!cfq_cfqq_sync(cfqq))
return NULL;
+ if (CFQQ_SEEKY(cfqq))
+ return NULL;

return cfqq;
}
@@ -1181,7 +1195,7 @@ static int cfqq_process_refs(struct cfq_queue *cfqq)

static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
{
- int process_refs;
+ int process_refs, new_process_refs;
struct cfq_queue *__cfqq;

/* Avoid a circular list and skip interim queue merges */
@@ -1199,8 +1213,17 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
if (process_refs == 0)
return;

- cfqq->new_cfqq = new_cfqq;
- atomic_add(process_refs, &new_cfqq->ref);
+ /*
+ * Merge in the direction of the lesser amount of work.
+ */
+ new_process_refs = cfqq_process_refs(new_cfqq);
+ if (new_process_refs >= process_refs) {
+ cfqq->new_cfqq = new_cfqq;
+ atomic_add(process_refs, &new_cfqq->ref);
+ } else {
+ new_cfqq->new_cfqq = cfqq;
+ atomic_add(new_process_refs, &cfqq->ref);
+ }
}

/*
@@ -2029,6 +2052,19 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
total = cfqq->seek_total + (cfqq->seek_samples/2);
do_div(total, cfqq->seek_samples);
cfqq->seek_mean = (sector_t)total;
+
+ /*
+ * If this cfqq is shared between multiple processes, check to
+ * make sure that those processes are still issuing I/Os within
+ * the mean seek distance. If not, it may be time to break the
+ * queues apart again.
+ */
+ if (cfq_cfqq_coop(cfqq)) {
+ if (CFQQ_SEEKY(cfqq) && !cfqq->seeky_start)
+ cfqq->seeky_start = jiffies;
+ else if (!CFQQ_SEEKY(cfqq))
+ cfqq->seeky_start = 0;
+ }
}

/*
@@ -2391,6 +2427,32 @@ cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
return cic_to_cfqq(cic, 1);
}

+static int should_split_cfqq(struct cfq_queue *cfqq)
+{
+ if (cfqq->seeky_start &&
+ time_after(jiffies, cfqq->seeky_start + CFQQ_COOP_TOUT))
+ return 1;
+ return 0;
+}
+
+/*
+ * Returns NULL if a new cfqq should be allocated, or the old cfqq if this
+ * was the last process referring to said cfqq.
+ */
+static struct cfq_queue *
+split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
+{
+ if (cfqq_process_refs(cfqq) == 1) {
+ cfqq->seeky_start = 0;
+ cfqq->pid = current->pid;
+ cfq_clear_cfqq_coop(cfqq);
+ return cfqq;
+ }
+
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(cfqq);
+ return NULL;
+}
/*
* Allocate cfq data structures associated with this request.
*/
@@ -2413,12 +2475,23 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
if (!cic)
goto queue_fail;

+new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
} else {
/*
+ * If the queue was seeky for too long, break it apart.
+ */
+ if (cfq_cfqq_coop(cfqq) && should_split_cfqq(cfqq)) {
+ cfq_log_cfqq(cfqd, cfqq, "breaking apart cfqq");
+ cfqq = split_cfqq(cic, cfqq);
+ if (!cfqq)
+ goto new_queue;
+ }
+
+ /*
* Check to see if this queue is scheduled to merge with
* another, closely cooperating queue. The merging of
* queues happens here as it must be done in process context.
--
1.6.2.5

2009-10-24 20:08:44

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

Hi Jeff,
this series looks good.
I like in particular the fact that you move seekiness detection in the cfqq.
This can help with processes that issue sequential reads and seeky
writes, or vice versa.
Probably, also the think time could be made per-cfqq, so that the
decision whether we should idle for a given cfqq is more precise.

On Fri, Oct 23, 2009 at 11:14 PM, Jeff Moyer <[email protected]> wrote:
> Hi,
>
> This is a follow-up patch to the original close cooperator support for
> CFQ. The problem is that some programs (NFSd, dump(8), iscsi target
> mode driver, qemu) interleave sequential I/Os between multiple threads
> or processes. The result is that there are large delays due to CFQ's
> idling logic that leads to very low throughput.

You identified the problem in the idling logic, that reduces the
throughput in this particular scenario, in which various threads or
processes issue (in random order) the I/O requests with different I/O
contexts on behalf of a single entity.
In this case, any idling between those threads is detrimental.
Ideally, such cases should be already spotted, since think time should
be high for such processes, so I wonder if this indicates a problem in
the current think time logic.
Can you send me your read-test, so I can investigate it?

Thanks,
Corrado

2009-10-26 11:40:09

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

On Sat, Oct 24 2009, Corrado Zoccolo wrote:
> Hi Jeff,
> this series looks good.
> I like in particular the fact that you move seekiness detection in the cfqq.
> This can help with processes that issue sequential reads and seeky
> writes, or vice versa.
> Probably, also the think time could be made per-cfqq, so that the
> decision whether we should idle for a given cfqq is more precise.
>
> On Fri, Oct 23, 2009 at 11:14 PM, Jeff Moyer <[email protected]> wrote:
> > Hi,
> >
> > This is a follow-up patch to the original close cooperator support for
> > CFQ. The problem is that some programs (NFSd, dump(8), iscsi target
> > mode driver, qemu) interleave sequential I/Os between multiple threads
> > or processes. The result is that there are large delays due to CFQ's
> > idling logic that leads to very low throughput.
>
> You identified the problem in the idling logic, that reduces the
> throughput in this particular scenario, in which various threads or
> processes issue (in random order) the I/O requests with different I/O
> contexts on behalf of a single entity.
> In this case, any idling between those threads is detrimental.
> Ideally, such cases should be already spotted, since think time should
> be high for such processes, so I wonder if this indicates a problem in
> the current think time logic.

That isn't necessarily true, it may just as well be that there's very
little think time (don't see the connection here). A test case to
demonstrate this would be a number of processes/threads splitting a
sequential read of a file between them.

--
Jens Axboe

2009-10-26 13:20:55

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

Hi Jens
On Mon, Oct 26, 2009 at 12:40 PM, Jens Axboe <[email protected]> wrote:
> On Sat, Oct 24 2009, Corrado Zoccolo wrote:
>> You identified the problem in the idling logic, that reduces the
>> throughput in this particular scenario, in which various threads or
>> processes issue (in random order) the I/O requests with different I/O
>> contexts on behalf of a single entity.
>> In this case, any idling between those threads is detrimental.
>> Ideally, such cases should be already spotted, since think time should
>> be high for such processes, so I wonder if this indicates a problem in
>> the current think time logic.
>
> That isn't necessarily true, it may just as well be that there's very
> little think time (don't see the connection here). A test case to
> demonstrate this would be a number of processes/threads splitting a
> sequential read of a file between them.

Jeff said that the huge performance drop was not observable with noop
or any other work conserving scheduler.
Since noop doesn't enforce any I/O ordering, but just ensures that any
I/O passes through ASAP,
this means that the biggest problem is due to idling, while the
increased seekiness has just a small impact.

So your test case doesn't actually match the observations: each thread
will always have new requests to submit (so idling doesn't penalize
too much here), while the seekiness introduced will be the most
important factor.

I think the real test case is something like (single dd through nfs via udp):
* there is a single thread, that submits a small number of requests
(e.g. 2) to a work queue, and wait for their completion before
submitting new requests
* there is a thread pool that executes those requests (1 thread runs 1
request), and signals back completion. Threads in the pool are
selected randomly.

In this case, the average think time should be > the average access
time, as soon as we have that the number of threads exceeds
2*#parallel_requests.

Corrado

>
> --
> Jens Axboe
>
>

2009-10-26 13:28:13

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

On Mon, Oct 26 2009, Corrado Zoccolo wrote:
> Hi Jens
> On Mon, Oct 26, 2009 at 12:40 PM, Jens Axboe <[email protected]> wrote:
> > On Sat, Oct 24 2009, Corrado Zoccolo wrote:
> >> You identified the problem in the idling logic, that reduces the
> >> throughput in this particular scenario, in which various threads or
> >> processes issue (in random order) the I/O requests with different I/O
> >> contexts on behalf of a single entity.
> >> In this case, any idling between those threads is detrimental.
> >> Ideally, such cases should be already spotted, since think time should
> >> be high for such processes, so I wonder if this indicates a problem in
> >> the current think time logic.
> >
> > That isn't necessarily true, it may just as well be that there's very
> > little think time (don't see the connection here). A test case to
> > demonstrate this would be a number of processes/threads splitting a
> > sequential read of a file between them.
>
> Jeff said that the huge performance drop was not observable with noop
> or any other work conserving scheduler.
> Since noop doesn't enforce any I/O ordering, but just ensures that any
> I/O passes through ASAP,
> this means that the biggest problem is due to idling, while the
> increased seekiness has just a small impact.

Not true, noop still does merging. And even if it didn't, if you have
queuing on the device side things may still work out. The key being that
you actually send those off to the device, which the idling will prevent
for CFQ. The biggest problem is of course due to idling, if we didn't
idle between the cooperating processes then there would not be an issue.
And this is exactly what Jeff has done, merge those.

The test app is of course timing sensitive to some degree, since if the
threads get too far out of sync then things will go down the drain.

You could argue that decreasing the seekiness threshold would "fix"
that, but that would surely not work for other cases where and app is
mostly sequential but has to fetch meta data and such.

> So your test case doesn't actually match the observations: each thread
> will always have new requests to submit (so idling doesn't penalize
> too much here), while the seekiness introduced will be the most
> important factor.
>
> I think the real test case is something like (single dd through nfs via udp):
> * there is a single thread, that submits a small number of requests
> (e.g. 2) to a work queue, and wait for their completion before
> submitting new requests
> * there is a thread pool that executes those requests (1 thread runs 1
> request), and signals back completion. Threads in the pool are
> selected randomly.

Same thing, you just get rid of the timing constraint. A test case would
ensure that as well.

--
Jens Axboe

2009-10-26 13:31:46

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

Corrado Zoccolo <[email protected]> writes:

> Hi Jeff,
> this series looks good.

Hi, Corrado. Thanks again for the review!

> I like in particular the fact that you move seekiness detection in the cfqq.
> This can help with processes that issue sequential reads and seeky
> writes, or vice versa.
> Probably, also the think time could be made per-cfqq, so that the
> decision whether we should idle for a given cfqq is more precise.

I'll have to think about that one. It would be good to know Jens'
opinion on the matter, too.

> On Fri, Oct 23, 2009 at 11:14 PM, Jeff Moyer <[email protected]> wrote:
>> Hi,
>>
>> This is a follow-up patch to the original close cooperator support for
>> CFQ. The problem is that some programs (NFSd, dump(8), iscsi target
>> mode driver, qemu) interleave sequential I/Os between multiple threads
>> or processes. The result is that there are large delays due to CFQ's
>> idling logic that leads to very low throughput.
>
> You identified the problem in the idling logic, that reduces the
> throughput in this particular scenario, in which various threads or
> processes issue (in random order) the I/O requests with different I/O
> contexts on behalf of a single entity.
> In this case, any idling between those threads is detrimental.
> Ideally, such cases should be already spotted, since think time should
> be high for such processes, so I wonder if this indicates a problem in
> the current think time logic.

For read-test2, the readers are not dependent upon each other. That is,
each process reads the blocks assigned to it, so they do no
"thinking", or waiting for the other processes, in between I/Os.

> Can you send me your read-test, so I can investigate it?

Sure thing. I didn't write it, it was provided by Moritoshi Oshiro to
aid in reproducing the dump issue. You can find it here:

http://people.redhat.com/jmoyer/read-test2.tar.gz

Cheers,
Jeff

2009-10-26 13:34:19

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

On Mon, Oct 26 2009, Jeff Moyer wrote:
> Corrado Zoccolo <[email protected]> writes:
>
> > Hi Jeff,
> > this series looks good.
>
> Hi, Corrado. Thanks again for the review!
>
> > I like in particular the fact that you move seekiness detection in the cfqq.
> > This can help with processes that issue sequential reads and seeky
> > writes, or vice versa.
> > Probably, also the think time could be made per-cfqq, so that the
> > decision whether we should idle for a given cfqq is more precise.
>
> I'll have to think about that one. It would be good to know Jens'
> opinion on the matter, too.

Your implementation looks fine, as usual I'm mostly worried about
performance impact and suitability (I hate having to work around
issues). But the win is so large in some cases that we should just go
ahead and merge it for .33, so I'll queue it up.

It would be nice to fix the in-kernel problem with NFS, since that is
doable.

--
Jens Axboe

2009-10-26 15:01:47

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH/RFC 0/4] cfq: implement merging and breaking up of cfq_queues

Jens Axboe <[email protected]> writes:

> On Mon, Oct 26 2009, Jeff Moyer wrote:
>> Corrado Zoccolo <[email protected]> writes:
>>
>> > Hi Jeff,
>> > this series looks good.
>>
>> Hi, Corrado. Thanks again for the review!
>>
>> > I like in particular the fact that you move seekiness detection in the cfqq.
>> > This can help with processes that issue sequential reads and seeky
>> > writes, or vice versa.
>> > Probably, also the think time could be made per-cfqq, so that the
>> > decision whether we should idle for a given cfqq is more precise.
>>
>> I'll have to think about that one. It would be good to know Jens'
>> opinion on the matter, too.
>
> Your implementation looks fine, as usual I'm mostly worried about
> performance impact and suitability (I hate having to work around
> issues). But the win is so large in some cases that we should just go
> ahead and merge it for .33, so I'll queue it up.

Great, thanks for the review. In this case, however, I was wondering
what your opinion was about moving the think time calculation to be per
cfqq. ;-)

> It would be nice to fix the in-kernel problem with NFS, since that is
> doable.

I'll see if I can get someone motivated to work on that. I'm not sure
that I can devote much time to the issue myself, unfortunately.

Cheers,
Jeff