2010-07-22 21:30:53

by Vivek Goyal

[permalink] [raw]
Subject: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable


Hi,

This is V4 of the patchset which implements a new tunable group_idle and also
implements IOPS mode for group fairness. Following are changes since V3.

- Cleaned up the code a bit to make clear that IOPS mode is effective only
for group scheduling and cfqq queue scheduling should not be affected. Note
that currently cfqq uses slightly different algorithms for cfq queue and
cfq group scheduling.

- Updated the documentation as per Christoph's comments.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-rc6-iops+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6120 12596 16530 23408 28984 35579 42061 47335 212613
bsr 1 2 5250 10545 16604 23717 24677 29997 36753 42571 190114
bsr 1 4 4437 10372 12546 17231 26100 32241 38208 35419 176554
bsr 1 8 4636 9367 11902 18948 24589 27472 30341 37262 164517

Notice that overall throughput is just around 164MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-rc6-iops+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6548 12174 17870 24063 29992 35695 41439 47034 214815
bsr 1 2 10299 20487 30460 39375 46812 52783 59455 64351 324022
bsr 1 4 10648 21735 32565 43442 52756 59513 64425 70324 355408
bsr 1 8 11818 24483 36779 48144 55623 62583 65478 72279 377187


Notice how overall throughput has shot upto 377MB/s while retaining the ability
to do the IO control.

This patchset implements a CFQ group IOPS fairness mode where if slice_idle=0
and if storage supports NCQ, CFQ starts doing accounting in terms of number
of requests dispatched and not in terms of time for groups.

This patchset also implements a new tunable group_idle, which allows one to set
slice_idle=0 to disable slice idling on cfqq and service tree but still idle on
group to make sure we can achieve better throughput for certain workloads
(read sequential) and also be able to achive service differentation among groups.

If you have thoughts on other ways of solving the problem, I am all ears
to it.

Thanks
Vivek


2010-07-22 21:30:06

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 2/5] cfq-iosched: Implment IOPS mode for group scheduling

o Implement another CFQ mode where we charge group in terms of number
of requests dispatched instead of measuring the time. Measuring in terms
of time is not possible when we are driving deeper queue depths and there
are requests from multiple cfq queues in the request queue.

o This mode currently gets activated if one sets slice_idle=0 and associated
disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
most of the queues will dispatch 1 or more requests and then cfq queue
expiry happens and we don't have a way to measure time. So start providing
fairness in terms of IOPS.

o Currently IOPS mode works only with cfq group scheduling. CFQ is following
different scheduling algorithms for queue and group scheduling. These IOPS
stats are used only for group scheduling hence in non-croup mode nothing
should change.

o For CFQ group scheduling one can disable slice idling so that we don't idle
on queue and drive deeper request queue depths (achieving better throughput),
at the same time group idle is enabled so one should get service
differentiation among groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c5ec2eb..9f82ec6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
&cfqg->service_trees[i][j]: NULL) \


+static inline bool iops_mode(struct cfq_data *cfqd)
+{
+ /*
+ * If we are not idling on queues and it is a NCQ drive, parallel
+ * execution of requests is on and measuring time is not possible
+ * in most of the cases until and unless we drive shallower queue
+ * depths and that becomes a performance bottleneck. In such cases
+ * switch to start providing fairness in terms of number of IOs.
+ */
+ if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
+ return true;
+ else
+ return false;
+}
+
static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
{
if (cfq_class_idle(cfqq))
@@ -905,7 +920,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
return slice_used;
}

@@ -913,19 +927,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl, charge_sl;
+ unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;

BUG_ON(nr_sync < 0);
- used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
+ used_sl = charge = cfq_cfqq_slice_usage(cfqq);

- if (!cfq_cfqq_sync(cfqq) && !nr_sync)
- charge_sl = cfqq->allocated_slice;
+ if (iops_mode(cfqd))
+ charge = cfqq->slice_dispatch;
+ else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
+ charge = cfqq->allocated_slice;

/* Can't update vdisktime while group is on service tree */
cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
+ cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
__cfq_group_service_tree_add(st, cfqg);

/* This group is being expired. Save the context */
@@ -939,6 +955,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
+ used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
--
1.7.1.1

2010-07-22 21:30:10

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 5/5] cfq-iosched: Documentation update

o Documentation update for group_idle tunable and Group IOPS mode.
---
Documentation/block/cfq-iosched.txt | 44 ++++++++++++++++++++++++++++
Documentation/cgroups/blkio-controller.txt | 28 +++++++++++++++++
2 files changed, 72 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/cfq-iosched.txt

diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
new file mode 100644
index 0000000..6cc2151
--- /dev/null
+++ b/Documentation/block/cfq-iosched.txt
@@ -0,0 +1,44 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non zero value. That means by default we idle on
+queues/service trees. This can be very helpful on highly seeky media like
+single spindle SATA/SAS disks where we can cut down on overall number of
+seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+ioprio becomes weaker.
+
+So depending on storage and workload, it might be a useful to set slice_idle=0.
+In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
+keeping slice_idle enabled should be useful. For any configurations where
+there are multiple spindles behind single LUN (Host based hardware RAID
+controller or for storage arrays), setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide prio based time slices. Higher prio process
+gets bigger time slice and lower prio process gets smaller time slice.
+Measuring time becomes harder if storage is fast and supports NCQ and it would
+be better to dispatch multiple requests from multiple cfq queues in request
+queue at a time. In such scenario, it is not possible to measure time consumed
+by single queue accurately.
+
+What is possible though to measure number of requests dispatched from a single
+queue and also allow dispatch from multiple cfqq at the same time. This
+effectively becomes the fairness in terms of IOPS (IO operations per second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non cgroup users nothing should change.
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 48e0b21..6919d62 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -217,6 +217,7 @@ Details of cgroup files
CFQ sysfs tunable
=================
/sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------

If group_isolation=1, it provides stronger isolation between groups at the
expense of throughput. By default group_isolation is 0. In general that
@@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
and one wants stronger isolation between groups, then set group_isolation=1
but this will come at cost of reduced throughput.

+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
What works
==========
- Currently only sync IO queues are support. All the buffered writes are
--
1.7.1.1

2010-07-22 21:30:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 4/5] cfq-iosched: Print number of sectors dispatched per cfqq slice

o Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.

Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Divyesh Shah <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e172fa1..147b3e8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ /* Number of sectors dispatched from queue in single dispatch round */
+ unsigned long nr_sectors;
};

/*
@@ -959,8 +961,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
- used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+ " sect=%u", used_sl, cfqq->slice_dispatch, charge,
+ iops_mode(cfqd), cfqq->nr_sectors);
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
@@ -1608,6 +1611,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1970,6 +1974,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
rq_data_dir(rq), rq_is_sync(rq));
}
--
1.7.1.1

2010-07-22 21:29:58

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 1/5] cfq-iosched: Do not idle on service tree if slice_idle=0

o do not idle either on cfq queue or service tree if slice_idle=0. User does
not want any queue or service tree idling. Currently even if slice_idle=0,
we were idling on service tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..c5ec2eb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1838,6 +1838,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);

+ if (!cfqd->cfq_slice_idle)
+ return false;
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1878,7 +1881,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
+ if (!cfq_should_idle(cfqd, cfqq))
return;

/*
--
1.7.1.1

2010-07-22 21:30:50

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 3/5] cfq-iosched: Implement a tunable group_idle

o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f82ec6..e172fa1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;

@@ -198,6 +199,8 @@ struct cfq_group {
struct hlist_node cfqd_node;
atomic_t ref;
#endif
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
};

/*
@@ -271,6 +274,7 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_group_idle;
unsigned int cfq_latency;
unsigned int cfq_group_isolation;

@@ -1883,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
- unsigned long sl;
+ unsigned long sl, group_idle = 0;

/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1899,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfq_should_idle(cfqd, cfqq))
- return;
+ if (!cfq_should_idle(cfqd, cfqq)) {
+ /* no queue idling. Check for group idling */
+ if (cfqd->cfq_group_idle)
+ group_idle = cfqd->cfq_group_idle;
+ else
+ return;
+ }

/*
* still active requests from this queue, don't idle
@@ -1927,13 +1936,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}

+ /* There are other queues in the group, don't do group idle */
+ if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+ return;
+
cfq_mark_cfqq_wait_request(cfqq);

- sl = cfqd->cfq_slice_idle;
+ if (group_idle)
+ sl = cfqd->cfq_group_idle;
+ else
+ sl = cfqd->cfq_slice_idle;

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+ group_idle ? 1 : 0);
}

/*
@@ -1949,6 +1966,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
+ (RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2218,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
cfqq = NULL;
goto keep_queue;
} else
- goto expire;
+ goto check_group_idle;
}

/*
@@ -2252,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
goto keep_queue;
}

+ /*
+ * If group idle is enabled and there are requests dispatched from
+ * this group, wait for requests to complete.
+ */
+check_group_idle:
+ if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+ && cfqq->cfqg->dispatched) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
@@ -3394,6 +3423,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ (RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
rq_data_dir(rq), rq_is_sync(rq));
@@ -3423,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* the queue.
*/
if (cfq_should_wait_busy(cfqd, cfqq)) {
- cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ unsigned long extend_sl = cfqd->cfq_slice_idle;
+ if (!cfqd->cfq_slice_idle)
+ extend_sl = cfqd->cfq_group_idle;
+ cfqq->slice_end = jiffies + extend_sl;
cfq_mark_cfqq_wait_busy(cfqq);
cfq_log_cfqq(cfqd, cfqq, "will busy wait");
}
@@ -3868,6 +3901,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_group_idle = cfq_group_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
@@ -3940,6 +3974,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3972,6 +4007,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3993,6 +4029,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(group_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_isolation),
__ATTR_NULL
@@ -4046,6 +4083,12 @@ static int __init cfq_init(void)
if (!cfq_slice_idle)
cfq_slice_idle = 1;

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfq_group_idle)
+ cfq_group_idle = 1;
+#else
+ cfq_group_idle = 0;
+#endif
if (cfq_slab_setup())
return -ENOMEM;

--
1.7.1.1

2010-07-22 21:38:45

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 5/5] cfq-iosched: Documentation update

On Thu, 22 Jul 2010 17:29:32 -0400 Vivek Goyal wrote:

> o Documentation update for group_idle tunable and Group IOPS mode.
> ---
> Documentation/block/cfq-iosched.txt | 44 ++++++++++++++++++++++++++++
> Documentation/cgroups/blkio-controller.txt | 28 +++++++++++++++++
> 2 files changed, 72 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/block/cfq-iosched.txt
>
> diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
> new file mode 100644
> index 0000000..6cc2151
> --- /dev/null
> +++ b/Documentation/block/cfq-iosched.txt
> @@ -0,0 +1,44 @@
> +CFQ ioscheduler tunables
> +========================
> +
> +slice_idle
> +----------
> +This specifies how long CFQ should idle for next request on certain cfq queues
> +(for sequential workloads) and service trees (for random workloads) before
> +queue is expired and CFQ selects next queue to dispatch from.
> +
> +By default slice_idle is a non zero value. That means by default we idle on

non-zero

> +queues/service trees. This can be very helpful on highly seeky media like
> +single spindle SATA/SAS disks where we can cut down on overall number of
> +seeks and see improved throughput.
> +
> +Setting slice_idle to 0 will remove all the idling on queues/service tree
> +level and one should see an overall improved throughput on faster storage
> +devices like multiple SATA/SAS disks in hardware RAID configuration. The down
> +side is that isolation provided from WRITES also goes down and notion of
> +ioprio becomes weaker.
> +
> +So depending on storage and workload, it might be a useful to set slice_idle=0.

might be useful

> +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
> +keeping slice_idle enabled should be useful. For any configurations where
> +there are multiple spindles behind single LUN (Host based hardware RAID
> +controller or for storage arrays), setting slice_idle=0 might end up in better
> +throughput and acceptable latencies.
> +
> +CFQ IOPS Mode for group scheduling
> +==================================
> +Basic CFQ design is to provide prio based time slices. Higher prio process
> +gets bigger time slice and lower prio process gets smaller time slice.

s/prio/priority/ multiple places.

> +Measuring time becomes harder if storage is fast and supports NCQ and it would
> +be better to dispatch multiple requests from multiple cfq queues in request
> +queue at a time. In such scenario, it is not possible to measure time consumed
> +by single queue accurately.
> +
> +What is possible though to measure number of requests dispatched from a single

though is to measure (?)

> +queue and also allow dispatch from multiple cfqq at the same time. This

what is cfqq? ^^^^

> +effectively becomes the fairness in terms of IOPS (IO operations per second).
> +
> +If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
> +to IOPS mode and starts providing fairness in terms of number of requests
> +dispatched. Note that this mode switching takes effect only for group
> +scheduling. For non cgroup users nothing should change.

non-cgroup

> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 48e0b21..6919d62 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -217,6 +217,7 @@ Details of cgroup files
> CFQ sysfs tunable
> =================
> /sys/block/<disk>/queue/iosched/group_isolation
> +-----------------------------------------------
>
> If group_isolation=1, it provides stronger isolation between groups at the
> expense of throughput. By default group_isolation is 0. In general that
> @@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
> and one wants stronger isolation between groups, then set group_isolation=1
> but this will come at cost of reduced throughput.
>
> +/sys/block/<disk>/queue/iosched/slice_idle
> +------------------------------------------
> +On a faster hardware CFQ can be slow, especially with sequential workload.
> +This happens because CFQ idles on a single queue and single queue might not
> +drive deeper request queue depths to keep the storage busy. In such scenarios
> +one can try setting slice_idle=0 and that would switch CFQ to IOPS
> +(IO operations per second) mode on NCQ supporting hardware.
> +
> +That means CFQ will not idle between cfq queues of a cfq group and hence be
> +able to driver higher queue depth and achieve better throughput. That also
> +means that cfq provides fairness among groups in terms of IOPS and not in
> +terms of disk time.
> +
> +/sys/block/<disk>/queue/iosched/group_idle
> +------------------------------------------
> +If one disables idling on individual cfq queues and cfq service trees by
> +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
> +on the group in an attempt to provide fairness among groups.
> +
> +By default group_idle is same as slice_idle and does not do anything if
> +slice_idle is enabled.
> +
> +One can experience an overall throughput drop if you have created multiple
> +groups and put applications in that group which are not driving enough
> +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
> +on individual groups and throughput should improve.
> +
> What works
> ==========
> - Currently only sync IO queues are support. All the buffered writes are
> --


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2010-07-23 14:04:15

by Heinz Diehl

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On 23.07.2010, Vivek Goyal wrote:

> This is V4 of the patchset which implements a new tunable group_idle and also
> implements IOPS mode for group fairness. Following are changes since V3.
[....]

Just for information: this patchset, applied to 2.6.35-rc6, gives about
20-25% increase in speed/throughput on my desktop system
(Phenom 2.5GHz Quadcore, 3 disks) with the tunables set according
to what you've used/reported here (the setup with slice_idle set to 0),
and it's measurable with fs_mark, too.

After 2 hours of hard testing, the machine remains stable and responsive.

2010-07-23 14:13:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Fri, Jul 23, 2010 at 04:03:43PM +0200, Heinz Diehl wrote:
> On 23.07.2010, Vivek Goyal wrote:
>
> > This is V4 of the patchset which implements a new tunable group_idle and also
> > implements IOPS mode for group fairness. Following are changes since V3.
> [....]
>
> Just for information: this patchset, applied to 2.6.35-rc6, gives about
> 20-25% increase in speed/throughput on my desktop system
> (Phenom 2.5GHz Quadcore, 3 disks) with the tunables set according
> to what you've used/reported here (the setup with slice_idle set to 0),
> and it's measurable with fs_mark, too.
>
> After 2 hours of hard testing, the machine remains stable and responsive.

Thanks for some testing Heinz. I am assuming you are not using cgroups
and blkio controller.

In that case, you are seeing improvements probably due to first patch
where we don't idle on service tree if slice_idle=0. Hence we cut down on
overall idling and can see throughput incrase.

What kind of configuration these 3 disks are on your system? Some Hardare
RAID or software RAID ?

Thanks
Vivek

2010-07-23 14:56:58

by Heinz Diehl

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On 23.07.2010, Vivek Goyal wrote:

> Thanks for some testing Heinz. I am assuming you are not using cgroups
> and blkio controller.

Not at all.

> In that case, you are seeing improvements probably due to first patch
> where we don't idle on service tree if slice_idle=0. Hence we cut down on
> overall idling and can see throughput incrase.

Hmm, in any case it's not getting worse by setting slice_idle to 8.

My main motivation to test your patches was that I thought
the other way 'round, and was just curious on how this patchset
will affect machines which are NOT a high end server/storage system :-)

> What kind of configuration these 3 disks are on your system? Some Hardare
> RAID or software RAID ?

Just 3 SATA disks plugged into the onboard controller, no RAID or whatsoever.

I used fs_mark for testing:
"fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 1 -w 4096 -F"

These are the results with plain cfq (2.6.35-rc6) and the settings which
gave the best speed/throughput on my machine:

low_latency = 0
slice_idle = 4
quantum = 32

Setting slice_idle to 0 didn't improve anything, I tried this before.

FSUse% Count Size Files/sec App Overhead
27 1000 65536 360.3 34133
27 2000 65536 384.4 34657
27 3000 65536 401.1 32994
27 4000 65536 394.3 33781
27 5000 65536 406.8 32569
27 6000 65536 401.9 34001
27 7000 65536 374.5 33192
27 8000 65536 398.3 32839
27 9000 65536 405.2 34110
27 10000 65536 398.9 33887
27 11000 65536 402.3 34111
27 12000 65536 398.1 33652
27 13000 65536 412.9 32443
27 14000 65536 408.1 32197


And this is after applying your patchset, with your settings
(and slice_idle = 0):

FSUse% Count Size Files/sec App Overhead
27 1000 65536 600.7 29579
27 2000 65536 568.4 30650
27 3000 65536 522.0 29171
27 4000 65536 534.1 29751
27 5000 65536 550.7 30168
27 6000 65536 521.7 30158
27 7000 65536 493.3 29211
27 8000 65536 495.3 30183
27 9000 65536 587.8 29881
27 10000 65536 469.9 29602
27 11000 65536 482.7 29557
27 12000 65536 486.6 30700
27 13000 65536 516.1 30243


There's some 2-3% further improvement on my system with these settings,
which after som fiddling turned out to give most performance here
(don't need the group settings, of course):

group_idle = 0
group_isolation = 0
low_latency = 1
quantum = 8
slice_idle = 8

Thanks,
Heinz.

2010-07-23 18:37:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Fri, Jul 23, 2010 at 04:56:31PM +0200, Heinz Diehl wrote:
> On 23.07.2010, Vivek Goyal wrote:
>
> > Thanks for some testing Heinz. I am assuming you are not using cgroups
> > and blkio controller.
>
> Not at all.
>
> > In that case, you are seeing improvements probably due to first patch
> > where we don't idle on service tree if slice_idle=0. Hence we cut down on
> > overall idling and can see throughput incrase.
>
> Hmm, in any case it's not getting worse by setting slice_idle to 8.
>
> My main motivation to test your patches was that I thought
> the other way 'round, and was just curious on how this patchset
> will affect machines which are NOT a high end server/storage system :-)
>
> > What kind of configuration these 3 disks are on your system? Some Hardare
> > RAID or software RAID ?
>
> Just 3 SATA disks plugged into the onboard controller, no RAID or whatsoever.
>
> I used fs_mark for testing:
> "fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 1 -w 4096 -F"
>
> These are the results with plain cfq (2.6.35-rc6) and the settings which
> gave the best speed/throughput on my machine:
>
> low_latency = 0
> slice_idle = 4
> quantum = 32
>
> Setting slice_idle to 0 didn't improve anything, I tried this before.
>
> FSUse% Count Size Files/sec App Overhead
> 27 1000 65536 360.3 34133
> 27 2000 65536 384.4 34657
> 27 3000 65536 401.1 32994
> 27 4000 65536 394.3 33781
> 27 5000 65536 406.8 32569
> 27 6000 65536 401.9 34001
> 27 7000 65536 374.5 33192
> 27 8000 65536 398.3 32839
> 27 9000 65536 405.2 34110
> 27 10000 65536 398.9 33887
> 27 11000 65536 402.3 34111
> 27 12000 65536 398.1 33652
> 27 13000 65536 412.9 32443
> 27 14000 65536 408.1 32197
>
>
> And this is after applying your patchset, with your settings
> (and slice_idle = 0):
>
> FSUse% Count Size Files/sec App Overhead
> 27 1000 65536 600.7 29579
> 27 2000 65536 568.4 30650
> 27 3000 65536 522.0 29171
> 27 4000 65536 534.1 29751
> 27 5000 65536 550.7 30168
> 27 6000 65536 521.7 30158
> 27 7000 65536 493.3 29211
> 27 8000 65536 495.3 30183
> 27 9000 65536 587.8 29881
> 27 10000 65536 469.9 29602
> 27 11000 65536 482.7 29557
> 27 12000 65536 486.6 30700
> 27 13000 65536 516.1 30243
>

I think that above improvement is due to first patch and changes in
cfq_should_idle(). cfq_should_idle() used to return 1 even if slice_idle=0
and that created bottlenecks at some places like in select_queue() we
will not expire a queue till request from that queue completed. This
stopped a new queue from dispatching requests etc...

Anyway, for fs_mark problem, can you give following patch a try.

https://patchwork.kernel.org/patch/113061/

Above patch should improve your fs_mark numbers even without setting
slice_idle=0.

Thanks
Vivek

2010-07-23 20:23:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 5/5] cfq-iosched: Documentation update

On Thu, Jul 22, 2010 at 02:36:59PM -0700, Randy Dunlap wrote:
> On Thu, 22 Jul 2010 17:29:32 -0400 Vivek Goyal wrote:
>
> > o Documentation update for group_idle tunable and Group IOPS mode.
> > ---

Thanks Randy. I have taken care of your comments in the attached patch.

Vivek

---
Documentation/block/cfq-iosched.txt | 45 +++++++++++++++++++++++++++++
Documentation/cgroups/blkio-controller.txt | 28 ++++++++++++++++++
2 files changed, 73 insertions(+)

Index: linux-2.6/Documentation/block/cfq-iosched.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/block/cfq-iosched.txt 2010-07-23 16:20:52.000000000 -0400
@@ -0,0 +1,45 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on highly seeky media like
+single spindle SATA/SAS disks where we can cut down on overall number of
+seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
+keeping slice_idle enabled should be useful. For any configurations where
+there are multiple spindles behind single LUN (Host based hardware RAID
+controller or for storage arrays), setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
Index: linux-2.6/Documentation/cgroups/blkio-controller.txt
===================================================================
--- linux-2.6.orig/Documentation/cgroups/blkio-controller.txt 2010-07-22 16:52:22.000000000 -0400
+++ linux-2.6/Documentation/cgroups/blkio-controller.txt 2010-07-23 16:16:09.000000000 -0400
@@ -217,6 +217,7 @@ Details of cgroup files
CFQ sysfs tunable
=================
/sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------

If group_isolation=1, it provides stronger isolation between groups at the
expense of throughput. By default group_isolation is 0. In general that
@@ -243,6 +244,33 @@ By default one should run with group_iso
and one wants stronger isolation between groups, then set group_isolation=1
but this will come at cost of reduced throughput.

+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
What works
==========
- Currently only sync IO queues are support. All the buffered writes are

2010-07-24 08:06:27

by Heinz Diehl

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On 23.07.2010, Vivek Goyal wrote:

> Anyway, for fs_mark problem, can you give following patch a try.
> https://patchwork.kernel.org/patch/113061/

Ported it to 2.6.35-rc6, and these are my results using the same fs_mark
call as before:

slice_idle = 0

FSUse% Count Size Files/sec App Overhead
28 1000 65536 241.6 39574
28 2000 65536 231.1 39939
28 3000 65536 230.4 39722
28 4000 65536 243.2 39646
28 5000 65536 227.0 39892
28 6000 65536 224.1 39555
28 7000 65536 228.2 39761
28 8000 65536 235.3 39766
28 9000 65536 237.3 40518
28 10000 65536 225.7 39861
28 11000 65536 227.2 39441


slice_idle = 8

FSUse% Count Size Files/sec App Overhead
28 1000 65536 502.2 30545
28 2000 65536 407.6 29406
28 3000 65536 381.8 30152
28 4000 65536 438.1 30038
28 5000 65536 447.5 30477
28 6000 65536 422.0 29610
28 7000 65536 383.1 30327
28 8000 65536 415.3 30102
28 9000 65536 397.6 31013
28 10000 65536 401.4 29201
28 11000 65536 408.8 29720
28 12000 65536 391.2 29157

Huh...there's quite a difference! It's definitely the slice_idle settings
which affect the results here. Besides, this patch gives noticeably bad
desktop interactivity on my system.

Don't know if this is related, but I'm not quite shure if XFS (which I use
exclusively) uses the jbd/jbd2 journaling layer at all.

Thanks,
Heinz.

2010-07-26 13:43:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Sat, Jul 24, 2010 at 10:06:13AM +0200, Heinz Diehl wrote:
> On 23.07.2010, Vivek Goyal wrote:
>
> > Anyway, for fs_mark problem, can you give following patch a try.
> > https://patchwork.kernel.org/patch/113061/
>
> Ported it to 2.6.35-rc6, and these are my results using the same fs_mark
> call as before:
>
> slice_idle = 0
>
> FSUse% Count Size Files/sec App Overhead
> 28 1000 65536 241.6 39574
> 28 2000 65536 231.1 39939
> 28 3000 65536 230.4 39722
> 28 4000 65536 243.2 39646
> 28 5000 65536 227.0 39892
> 28 6000 65536 224.1 39555
> 28 7000 65536 228.2 39761
> 28 8000 65536 235.3 39766
> 28 9000 65536 237.3 40518
> 28 10000 65536 225.7 39861
> 28 11000 65536 227.2 39441
>
>
> slice_idle = 8
>
> FSUse% Count Size Files/sec App Overhead
> 28 1000 65536 502.2 30545
> 28 2000 65536 407.6 29406
> 28 3000 65536 381.8 30152
> 28 4000 65536 438.1 30038
> 28 5000 65536 447.5 30477
> 28 6000 65536 422.0 29610
> 28 7000 65536 383.1 30327
> 28 8000 65536 415.3 30102
> 28 9000 65536 397.6 31013
> 28 10000 65536 401.4 29201
> 28 11000 65536 408.8 29720
> 28 12000 65536 391.2 29157
>
> Huh...there's quite a difference! It's definitely the slice_idle settings
> which affect the results here.

In this case it is not slice_idle. This patch puts both fsync writer and
jbd thread on same service tree. That way once fsync writer is done there
is no idling after that and jbd thread almost immediately gets to dispatch
requests to disk hence we see improved throughput.

> Besides, this patch gives noticeably bad desktop interactivity on my system.
>

How do you measure it? IOW, are you running something else also on the
desktop in the background. Like a heavy writer etc and then measuring
how interactive desktop feels?

> Don't know if this is related, but I'm not quite shure if XFS (which I use
> exclusively) uses the jbd/jbd2 journaling layer at all.

I also don't know. But because this patch is making a difference with your
XFS file system performance, may be it does use.

CCing Christoph, he can tell us.

Thanks
Vivek

2010-07-26 13:48:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Mon, Jul 26, 2010 at 09:43:29AM -0400, Vivek Goyal wrote:
> > Don't know if this is related, but I'm not quite shure if XFS (which I use
> > exclusively) uses the jbd/jbd2 journaling layer at all.
>
> I also don't know. But because this patch is making a difference with your
> XFS file system performance, may be it does use.
>
> CCing Christoph, he can tell us.

No, of course XFS doesn't use jbd.

2010-07-26 13:54:25

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Mon, Jul 26, 2010 at 09:48:18AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 26, 2010 at 09:43:29AM -0400, Vivek Goyal wrote:
> > > Don't know if this is related, but I'm not quite shure if XFS (which I use
> > > exclusively) uses the jbd/jbd2 journaling layer at all.
> >
> > I also don't know. But because this patch is making a difference with your
> > XFS file system performance, may be it does use.
> >
> > CCing Christoph, he can tell us.
>
> No, of course XFS doesn't use jbd.

Hmm.., interesting. So somewhere WRITE_SYNC idling in CFQ is hurting XFS
performance also. This time for some other reason and not jbd/jbd2.

Vivek

2010-07-26 14:13:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

Just curious, what numbers do you see when simply using the deadline
I/O scheduler? That's what we recommend for use with XFS anyway.

2010-07-26 16:15:17

by Heinz Diehl

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On 26.07.2010, Vivek Goyal wrote:

> How do you measure it? IOW, are you running something else also on the
> desktop in the background. Like a heavy writer etc and then measuring
> how interactive desktop feels?

I used Linus' "bigfile torture test" in the background:

while : ; do time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ;
sync; rm bigfile"; done

and Theodore Tso's "fsync-tester" to benchmark interactivity.

Didn't save any results, wasn't expecting that this could be of any
further interest (but can run the tests one more time, if desired).

Thanks,
Heinz.

2010-07-27 05:50:20

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 2/5] cfq-iosched: Implment IOPS mode for group scheduling

Vivek Goyal wrote:
> o Implement another CFQ mode where we charge group in terms of number
> of requests dispatched instead of measuring the time. Measuring in terms
> of time is not possible when we are driving deeper queue depths and there
> are requests from multiple cfq queues in the request queue.
>
> o This mode currently gets activated if one sets slice_idle=0 and associated
> disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
> most of the queues will dispatch 1 or more requests and then cfq queue
> expiry happens and we don't have a way to measure time. So start providing
> fairness in terms of IOPS.
>
> o Currently IOPS mode works only with cfq group scheduling. CFQ is following
> different scheduling algorithms for queue and group scheduling. These IOPS
> stats are used only for group scheduling hence in non-croup mode nothing
> should change.
>
> o For CFQ group scheduling one can disable slice idling so that we don't idle
> on queue and drive deeper request queue depths (achieving better throughput),
> at the same time group idle is enabled so one should get service
> differentiation among groups.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 30 ++++++++++++++++++++++++------
> 1 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c5ec2eb..9f82ec6 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
> &cfqg->service_trees[i][j]: NULL) \
>
>
> +static inline bool iops_mode(struct cfq_data *cfqd)
> +{
> + /*
> + * If we are not idling on queues and it is a NCQ drive, parallel
> + * execution of requests is on and measuring time is not possible
> + * in most of the cases until and unless we drive shallower queue
> + * depths and that becomes a performance bottleneck. In such cases
> + * switch to start providing fairness in terms of number of IOs.
> + */
> + if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
> + return true;
> + else
> + return false;
> +}
> +
> static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
> {
> if (cfq_class_idle(cfqq))
> @@ -905,7 +920,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> slice_used = cfqq->allocated_slice;
> }
>
> - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> return slice_used;
> }
>
> @@ -913,19 +927,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> struct cfq_queue *cfqq)
> {
> struct cfq_rb_root *st = &cfqd->grp_service_tree;
> - unsigned int used_sl, charge_sl;
> + unsigned int used_sl, charge;
> int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
> - cfqg->service_tree_idle.count;
>
> BUG_ON(nr_sync < 0);
> - used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
> + used_sl = charge = cfq_cfqq_slice_usage(cfqq);
>
> - if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> - charge_sl = cfqq->allocated_slice;
> + if (iops_mode(cfqd))
> + charge = cfqq->slice_dispatch;

Hi Vivek,

At this time, requests may still stay in dispatch list, shall we add a new variable
in cfqq to keep track of the number of requests that go into driver, and charging
this number?

Thanks
Gui

> + else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> + charge = cfqq->allocated_slice;
>
> /* Can't update vdisktime while group is on service tree */
> cfq_rb_erase(&cfqg->rb_node, st);
> - cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
> + cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
> __cfq_group_service_tree_add(st, cfqg);
>
> /* This group is being expired. Save the context */
> @@ -939,6 +955,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>
> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
> st->min_vdisktime);
> + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
> + used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
> cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
> cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
> }

2010-07-27 07:49:29

by Heinz Diehl

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On 26.07.2010, Christoph Hellwig wrote:

> Just curious, what numbers do you see when simply using the deadline
> I/O scheduler? That's what we recommend for use with XFS anyway.

Some fs_mark testing first:

Deadline, 1 thread:

# ./fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 1 -w 4096 -F

FSUse% Count Size Files/sec App Overhead
26 1000 65536 227.7 39998
26 2000 65536 229.2 39309
26 3000 65536 236.4 40232
26 4000 65536 231.1 39294
26 5000 65536 233.4 39728
26 6000 65536 234.2 39719
26 7000 65536 227.9 39463
26 8000 65536 239.0 39477
26 9000 65536 233.1 39563
26 10000 65536 233.1 39878
26 11000 65536 233.2 39560

Deadline, 4 threads:

# ./fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 4 -w 4096 -F

FSUse% Count Size Files/sec App Overhead
26 4000 65536 465.6 148470
26 8000 65536 398.6 152827
26 12000 65536 472.7 147235
26 16000 65536 477.0 149344
27 20000 65536 489.7 148055
27 24000 65536 444.3 152806
27 28000 65536 515.5 144821
27 32000 65536 501.0 146561
27 36000 65536 456.8 150124
27 40000 65536 427.8 148830
27 44000 65536 489.6 149843
27 48000 65536 467.8 147501


CFQ, 1 thread:

# ./fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 1 -w 4096 -F

FSUse% Count Size Files/sec App Overhead
27 1000 65536 439.3 30158
27 2000 65536 457.7 30274
27 3000 65536 432.0 30572
27 4000 65536 413.9 29641
27 5000 65536 410.4 30289
27 6000 65536 458.5 29861
27 7000 65536 441.1 30268
27 8000 65536 459.3 28900
27 9000 65536 420.1 30439
27 10000 65536 426.1 30628
27 11000 65536 479.7 30058

CFQ, 4 threads:

# ./fs_mark -S 1 -D 10000 -N 100000 -d /home/htd/fsmark/test -s 65536 -t 4 -w 4096 -F

FSUse% Count Size Files/sec App Overhead
27 4000 65536 540.7 149177
27 8000 65536 469.6 147957
27 12000 65536 507.6 149185
27 16000 65536 460.0 145953
28 20000 65536 534.3 151936
28 24000 65536 542.1 147083
28 28000 65536 516.0 149363
28 32000 65536 534.3 148655
28 36000 65536 511.1 146989
28 40000 65536 499.9 147884
28 44000 65536 514.3 147846
28 48000 65536 467.1 148099
28 52000 65536 454.7 149052


Here are the results of the fsync-tester, doing

"while : ; do time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ;
sync; rm bigfile"; done"

in the background on the root fs and running fsync-tester on /home.

Deadline:

liesel:~/test # ./fsync-tester
fsync time: 7.7866
fsync time: 9.5638
fsync time: 5.8163
fsync time: 5.5412
fsync time: 5.2630
fsync time: 8.6688
fsync time: 3.9947
fsync time: 5.4753
fsync time: 14.7666
fsync time: 4.0060
fsync time: 3.9231
fsync time: 4.0635
fsync time: 1.6129
^C

CFQ:

liesel:/home/htd/fs # liesel:~/test # ./fsync-tester
fsync time: 0.2457
fsync time: 0.3045
fsync time: 0.1980
fsync time: 0.2011
fsync time: 0.1941
fsync time: 0.2580
fsync time: 0.2041
fsync time: 0.2671
fsync time: 0.0320
fsync time: 0.2372
^C

The same setup here, running both the "bigfile torture test" and
fsync-tester on /home:

Deadline:

htd@liesel:~/fs> ./fsync-tester
fsync time: 11.0455
fsync time: 18.3555
fsync time: 6.8022
fsync time: 14.2020
fsync time: 9.4786
fsync time: 10.3002
fsync time: 7.2607
fsync time: 8.2169
fsync time: 3.7805
fsync time: 7.0325
fsync time: 12.0827
^C


CFQ:
htd@liesel:~/fs> ./fsync-tester
fsync time: 13.1126
fsync time: 4.9432
fsync time: 4.7833
fsync time: 0.2117
fsync time: 0.0167
fsync time: 14.6472
fsync time: 10.7527
fsync time: 4.3230
fsync time: 0.0151
fsync time: 15.1668
fsync time: 10.7662
fsync time: 0.1670
fsync time: 0.0156
^C

All partitions are XFS formatted using

mkfs.xfs -f -l lazy-count=1,version=2 -i attr=2 -d agcount=4

and mounted that way:

(rw,noatime,logbsize=256k,logbufs=2,nobarrier)

Kernel is 2.6.35-rc6.


Thanks, Heinz.

2010-07-27 13:09:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/5] cfq-iosched: Implment IOPS mode for group scheduling

On Tue, Jul 27, 2010 at 01:47:39PM +0800, Gui Jianfeng wrote:

[..]
> > @@ -913,19 +927,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> > struct cfq_queue *cfqq)
> > {
> > struct cfq_rb_root *st = &cfqd->grp_service_tree;
> > - unsigned int used_sl, charge_sl;
> > + unsigned int used_sl, charge;
> > int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
> > - cfqg->service_tree_idle.count;
> >
> > BUG_ON(nr_sync < 0);
> > - used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
> > + used_sl = charge = cfq_cfqq_slice_usage(cfqq);
> >
> > - if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> > - charge_sl = cfqq->allocated_slice;
> > + if (iops_mode(cfqd))
> > + charge = cfqq->slice_dispatch;
>
> Hi Vivek,
>
> At this time, requests may still stay in dispatch list, shall we add a new variable
> in cfqq to keep track of the number of requests that go into driver, and charging
> this number?
>

Hi Gui,

How does that help. Even if request is in dispatch list, sooner or later
it will be dispatched. As long as we can make sure that requests in
dispatch list are in proportion to group weights, things should be just
fine.

Thanks
Vivek

2010-07-28 20:22:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Sat, Jul 24, 2010 at 10:06:13AM +0200, Heinz Diehl wrote:
> On 23.07.2010, Vivek Goyal wrote:
>
> > Anyway, for fs_mark problem, can you give following patch a try.
> > https://patchwork.kernel.org/patch/113061/
>
> Ported it to 2.6.35-rc6, and these are my results using the same fs_mark
> call as before:
>
> slice_idle = 0
>
> FSUse% Count Size Files/sec App Overhead
> 28 1000 65536 241.6 39574
> 28 2000 65536 231.1 39939
> 28 3000 65536 230.4 39722
> 28 4000 65536 243.2 39646
> 28 5000 65536 227.0 39892
> 28 6000 65536 224.1 39555
> 28 7000 65536 228.2 39761
> 28 8000 65536 235.3 39766
> 28 9000 65536 237.3 40518
> 28 10000 65536 225.7 39861
> 28 11000 65536 227.2 39441
>
>
> slice_idle = 8
>
> FSUse% Count Size Files/sec App Overhead
> 28 1000 65536 502.2 30545
> 28 2000 65536 407.6 29406
> 28 3000 65536 381.8 30152
> 28 4000 65536 438.1 30038
> 28 5000 65536 447.5 30477
> 28 6000 65536 422.0 29610
> 28 7000 65536 383.1 30327
> 28 8000 65536 415.3 30102
> 28 9000 65536 397.6 31013
> 28 10000 65536 401.4 29201
> 28 11000 65536 408.8 29720
> 28 12000 65536 391.2 29157
>
> Huh...there's quite a difference! It's definitely the slice_idle settings
> which affect the results here.


> Besides, this patch gives noticeably bad desktop interactivity on my system.

Heinz,

I also ran linus torture test and fsync-tester on ext3 file system on my
SATA disk and with this corrado's fsync patch applied in fact I see better
results.

2.6.35-rc6 kernel
=================
fsync time: 1.2109
fsync time: 2.7531
fsync time: 1.3770
fsync time: 2.0839
fsync time: 1.4243
fsync time: 1.3211
fsync time: 1.1672
fsync time: 2.8345
fsync time: 1.4798
fsync time: 0.0170
fsync time: 0.0199
fsync time: 0.0204
fsync time: 0.2794
fsync time: 1.3525
fsync time: 2.2679
fsync time: 1.4629
fsync time: 1.5234
fsync time: 1.5693
fsync time: 1.7263
fsync time: 3.5739
fsync time: 1.4114
fsync time: 1.5517
fsync time: 1.5675
fsync time: 1.3818
fsync time: 1.8127
fsync time: 1.6394

2.6.35-rc6-fsync
================
fsync time: 3.8638
fsync time: 0.1209
fsync time: 2.3390
fsync time: 3.1501
fsync time: 0.1348
fsync time: 0.0879
fsync time: 1.0642
fsync time: 0.2153
fsync time: 0.1166
fsync time: 0.2744
fsync time: 0.1227
fsync time: 0.2072
fsync time: 0.0666
fsync time: 0.1818
fsync time: 0.2170
fsync time: 0.1814
fsync time: 0.0501
fsync time: 0.0198
fsync time: 0.1950
fsync time: 0.2099
fsync time: 0.0877
fsync time: 0.8291
fsync time: 0.0821
fsync time: 0.0777
fsync time: 0.0258
fsync time: 0.0574
fsync time: 0.1152
fsync time: 1.1466
fsync time: 0.2349
fsync time: 0.9589
fsync time: 1.1013
fsync time: 0.1681
fsync time: 0.0902
fsync time: 0.2052
fsync time: 0.0673

I also did "time firefox &" testing to see how long firefox takes to
launch when linus torture test is running and without patch it took
around 20 seconds and with patch it took around 17 seconds.

So to me above test results suggest that this patch does not worsen
the performance. In fact it helps. (at least on ext3 file system.)

Not sure why are you seeing different results with XFS.

Thanks
Vivek

2010-07-28 23:57:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable

On Wed, Jul 28, 2010 at 04:22:12PM -0400, Vivek Goyal wrote:
> I also did "time firefox &" testing to see how long firefox takes to
> launch when linus torture test is running and without patch it took
> around 20 seconds and with patch it took around 17 seconds.
>
> So to me above test results suggest that this patch does not worsen
> the performance. In fact it helps. (at least on ext3 file system.)
>
> Not sure why are you seeing different results with XFS.

So why didn't you test it with XFS to verify his results? We all know
that different filesystems have different I/O patters, and we have
a history of really nasty regressions in one filesystem by good meaning
changes to the I/O scheduler.

ext3 in fact is a particularly bad test case as it not only doesn't have
I/O barriers enabled, but also has particularly bad I/O patterns
compared to modern filesystems.

2010-07-29 04:35:09

by Vivek Goyal

[permalink] [raw]
Subject: cfq fsync patch testing results (Was: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable)

On Wed, Jul 28, 2010 at 07:57:16PM -0400, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 04:22:12PM -0400, Vivek Goyal wrote:
> > I also did "time firefox &" testing to see how long firefox takes to
> > launch when linus torture test is running and without patch it took
> > around 20 seconds and with patch it took around 17 seconds.
> >
> > So to me above test results suggest that this patch does not worsen
> > the performance. In fact it helps. (at least on ext3 file system.)
> >
> > Not sure why are you seeing different results with XFS.
>
> So why didn't you test it with XFS to verify his results?

Just got little lazy. Find the testing results with ext3, ext4 and
xfs below.

> We all know
> that different filesystems have different I/O patters, and we have
> a history of really nasty regressions in one filesystem by good meaning
> changes to the I/O scheduler.
>
> ext3 in fact is a particularly bad test case as it not only doesn't have
> I/O barriers enabled, but also has particularly bad I/O patterns
> compared to modern filesystems.

Ext3 results
============
ext3 (2.6.35-rc6) ext3 (35-rc6-fsync)
----------------- -------------------
fsync time: 3.4173 fsync time: 0.0171
fsync time: 0.8831 fsync time: 0.0951
fsync time: 0.6985 fsync time: 0.0848
fsync time: 8.9449 fsync time: 0.1206
fsync time: 4.3075 fsync time: 0.4150
fsync time: 6.0146 fsync time: 0.0856
fsync time: 9.7134 fsync time: 0.1151
fsync time: 9.2247 fsync time: 0.1083
fsync time: 6.5061 fsync time: 0.1218
fsync time: 6.1862 fsync time: 4.1666
fsync time: 6.1136 fsync time: 0.1075
fsync time: 3.3593 fsync time: 0.3442
fsync time: 4.3309 fsync time: 0.1062
fsync time: 2.3596 fsync time: 2.8502
fsync time: 0.0151 fsync time: 0.0433
fsync time: 0.0180 fsync time: 4.0526
fsync time: 0.3685 fsync time: 0.1819
fsync time: 2.7396 fsync time: 0.1479
fsync time: 3.1537 fsync time: 0.1480
fsync time: 2.4474 fsync time: 0.1715
fsync time: 2.7085 fsync time: 0.0079
fsync time: 3.1629 fsync time: 0.0181
fsync time: 2.9186 fsync time: 0.0134

XFS results
==========
XFS (2.6.35-rc6) XFS (with fsync patch)
fsync time: 5.0746 fsync time: 1.8025
fsync time: 3.0057 fsync time: 2.3392
fsync time: 3.0960 fsync time: 2.2810
fsync time: 2.8392 fsync time: 2.2894
fsync time: 2.4901 fsync time: 2.3059
fsync time: 2.3151 fsync time: 2.3061
fsync time: 2.3066 fsync time: 2.9825
fsync time: 0.6608 fsync time: 2.3144
fsync time: 0.0595 fsync time: 2.2894
fsync time: 2.0977 fsync time: 0.0508
fsync time: 2.3236 fsync time: 2.3396
fsync time: 2.3229 fsync time: 2.3310
fsync time: 2.3065 fsync time: 2.3061
fsync time: 2.3234 fsync time: 2.3060
fsync time: 2.3150 fsync time: 2.3561
fsync time: 2.3149 fsync time: 2.3313
fsync time: 2.3234 fsync time: 2.0221
fsync time: 2.3066 fsync time: 2.2891
fsync time: 2.3232 fsync time: 2.3144
fsync time: 2.3317 fsync time: 2.3144
fsync time: 2.3321 fsync time: 2.2894
fsync time: 2.3232 fsync time: 2.3228
fsync time: 0.0514 fsync time: 2.3144
fsync time: 2.2480 fsync time: 0.0506

Ext4
====
ext4 (vanilla) ext4 (patched)
fsync time: 3.4080 fsync time: 2.9109
fsync time: 17.8330 fsync time: 25.0503
fsync time: 0.0922 fsync time: 2.5495
fsync time: 0.0710 fsync time: 0.0943
fsync time: 19.7977 fsync time: 0.0770
fsync time: 20.6592 fsync time: 16.3287
fsync time: 0.1020 fsync time: 24.4983
fsync time: 0.0689 fsync time: 0.1006
fsync time: 19.9981 fsync time: 0.0783
fsync time: 20.6605 fsync time: 19.1181
fsync time: 0.0930 fsync time: 22.0860
fsync time: 0.0776 fsync time: 0.0909


Notes:
======
- Above results are with and without corrado's fsync issue patch. We
happen to be discussing it in a different thread though, hence
specifying it specifically.

- I am running linus torture test and also running ted so's fsync-tester
to monitor fsync latencies.

- Looks like ext3 fsync times have improved.
- XFS fsync times have remained unchanged.
- ext4 fsync times seems to have gone up a bit.

I used default mount options. So I am assuming high fsync times of ext4
comes from the fact that barriers much be enabled by default. Will do
some blktracing on ext4 case tomorrow, otherwise I think this patch
looks good.

Thanks
Vivek

2010-07-29 14:56:52

by Vivek Goyal

[permalink] [raw]
Subject: Re: cfq fsync patch testing results (Was: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable)

On Thu, Jul 29, 2010 at 12:34:43AM -0400, Vivek Goyal wrote:
> On Wed, Jul 28, 2010 at 07:57:16PM -0400, Christoph Hellwig wrote:
> > On Wed, Jul 28, 2010 at 04:22:12PM -0400, Vivek Goyal wrote:
> > > I also did "time firefox &" testing to see how long firefox takes to
> > > launch when linus torture test is running and without patch it took
> > > around 20 seconds and with patch it took around 17 seconds.
> > >
> > > So to me above test results suggest that this patch does not worsen
> > > the performance. In fact it helps. (at least on ext3 file system.)
> > >
> > > Not sure why are you seeing different results with XFS.
> >
> > So why didn't you test it with XFS to verify his results?
>
> Just got little lazy. Find the testing results with ext3, ext4 and
> xfs below.
>
> > We all know
> > that different filesystems have different I/O patters, and we have
> > a history of really nasty regressions in one filesystem by good meaning
> > changes to the I/O scheduler.
> >
> > ext3 in fact is a particularly bad test case as it not only doesn't have
> > I/O barriers enabled, but also has particularly bad I/O patterns
> > compared to modern filesystems.
>
> Ext3 results
> ============
> ext3 (2.6.35-rc6) ext3 (35-rc6-fsync)
> ----------------- -------------------
> fsync time: 3.4173 fsync time: 0.0171
> fsync time: 0.8831 fsync time: 0.0951
> fsync time: 0.6985 fsync time: 0.0848
> fsync time: 8.9449 fsync time: 0.1206
> fsync time: 4.3075 fsync time: 0.4150
> fsync time: 6.0146 fsync time: 0.0856
> fsync time: 9.7134 fsync time: 0.1151
> fsync time: 9.2247 fsync time: 0.1083
> fsync time: 6.5061 fsync time: 0.1218
> fsync time: 6.1862 fsync time: 4.1666
> fsync time: 6.1136 fsync time: 0.1075
> fsync time: 3.3593 fsync time: 0.3442
> fsync time: 4.3309 fsync time: 0.1062
> fsync time: 2.3596 fsync time: 2.8502
> fsync time: 0.0151 fsync time: 0.0433
> fsync time: 0.0180 fsync time: 4.0526
> fsync time: 0.3685 fsync time: 0.1819
> fsync time: 2.7396 fsync time: 0.1479
> fsync time: 3.1537 fsync time: 0.1480
> fsync time: 2.4474 fsync time: 0.1715
> fsync time: 2.7085 fsync time: 0.0079
> fsync time: 3.1629 fsync time: 0.0181
> fsync time: 2.9186 fsync time: 0.0134
>
> XFS results
> ==========
> XFS (2.6.35-rc6) XFS (with fsync patch)
> fsync time: 5.0746 fsync time: 1.8025
> fsync time: 3.0057 fsync time: 2.3392
> fsync time: 3.0960 fsync time: 2.2810
> fsync time: 2.8392 fsync time: 2.2894
> fsync time: 2.4901 fsync time: 2.3059
> fsync time: 2.3151 fsync time: 2.3061
> fsync time: 2.3066 fsync time: 2.9825
> fsync time: 0.6608 fsync time: 2.3144
> fsync time: 0.0595 fsync time: 2.2894
> fsync time: 2.0977 fsync time: 0.0508
> fsync time: 2.3236 fsync time: 2.3396
> fsync time: 2.3229 fsync time: 2.3310
> fsync time: 2.3065 fsync time: 2.3061
> fsync time: 2.3234 fsync time: 2.3060
> fsync time: 2.3150 fsync time: 2.3561
> fsync time: 2.3149 fsync time: 2.3313
> fsync time: 2.3234 fsync time: 2.0221
> fsync time: 2.3066 fsync time: 2.2891
> fsync time: 2.3232 fsync time: 2.3144
> fsync time: 2.3317 fsync time: 2.3144
> fsync time: 2.3321 fsync time: 2.2894
> fsync time: 2.3232 fsync time: 2.3228
> fsync time: 0.0514 fsync time: 2.3144
> fsync time: 2.2480 fsync time: 0.0506
>
> Ext4
> ====
> ext4 (vanilla) ext4 (patched)
> fsync time: 3.4080 fsync time: 2.9109
> fsync time: 17.8330 fsync time: 25.0503
> fsync time: 0.0922 fsync time: 2.5495
> fsync time: 0.0710 fsync time: 0.0943
> fsync time: 19.7977 fsync time: 0.0770
> fsync time: 20.6592 fsync time: 16.3287
> fsync time: 0.1020 fsync time: 24.4983
> fsync time: 0.0689 fsync time: 0.1006
> fsync time: 19.9981 fsync time: 0.0783
> fsync time: 20.6605 fsync time: 19.1181
> fsync time: 0.0930 fsync time: 22.0860
> fsync time: 0.0776 fsync time: 0.0909
>
>
> Notes:
> ======
> - Above results are with and without corrado's fsync issue patch. We
> happen to be discussing it in a different thread though, hence
> specifying it specifically.
>
> - I am running linus torture test and also running ted so's fsync-tester
> to monitor fsync latencies.
>
> - Looks like ext3 fsync times have improved.
> - XFS fsync times have remained unchanged.
> - ext4 fsync times seems to have gone up a bit.
>
> I used default mount options. So I am assuming high fsync times of ext4
> comes from the fact that barriers much be enabled by default. Will do
> some blktracing on ext4 case tomorrow, otherwise I think this patch
> looks good.

For the sake of completeness, I also ran same tests on ext3 with barrier
enabled.

ext3 (barrier=1) ext3 (barrier=1)
fsync time: 2.7601 fsync time: 1.5323
fsync time: 2.2352 fsync time: 1.5254
fsync time: 2.1689 fsync time: 1.4228
fsync time: 2.1666 fsync time: 1.8404
fsync time: 2.3017 fsync time: 5.6249
fsync time: 2.2256 fsync time: 1.6099
fsync time: 2.1588 fsync time: 1.5318
fsync time: 5.1648 fsync time: 2.0092
fsync time: 5.8390 fsync time: 1.9966
fsync time: 0.2109 fsync time: 2.0055
fsync time: 0.0906 fsync time: 2.0054
fsync time: 3.6327 fsync time: 0.1778
fsync time: 3.0161 fsync time: 0.0827
fsync time: 2.3194 fsync time: 2.3796
fsync time: 2.0581 fsync time: 1.5960
fsync time: 2.2850 fsync time: 1.5074
fsync time: 2.2002 fsync time: 1.8653
fsync time: 2.1932 fsync time: 1.8910
fsync time: 2.1753 fsync time: 1.9091
fsync time: 2.1669 fsync time: 1.8322
fsync time: 2.1671 fsync time: 1.8744
fsync time: 1.9552 fsync time: 1.8254
fsync time: 3.9870 fsync time: 1.8662
fsync time: 2.5140 fsync time: 1.8587
fsync time: 0.0867 fsync time: 1.7981

It is hard to say whether things improved or not with patch. I guess
slight improvement is there.

What is interesting though that this fsync-tester test case works well
with ext3 and xfs but with ext4 there seems to be large spikes in
fsync times.

[CCing Ted Tso]

Thanks
Vivek

2010-07-29 19:40:01

by Jeff Moyer

[permalink] [raw]
Subject: Re: cfq fsync patch testing results (Was: Re: [RFC PATCH] cfq-iosched: IOPS mode for group scheduling and new group_idle tunable)

Vivek Goyal <[email protected]> writes:

> On Thu, Jul 29, 2010 at 12:34:43AM -0400, Vivek Goyal wrote:
>> On Wed, Jul 28, 2010 at 07:57:16PM -0400, Christoph Hellwig wrote:
>> > On Wed, Jul 28, 2010 at 04:22:12PM -0400, Vivek Goyal wrote:
>> > > I also did "time firefox &" testing to see how long firefox takes to
>> > > launch when linus torture test is running and without patch it took
>> > > around 20 seconds and with patch it took around 17 seconds.
>> > >
>> > > So to me above test results suggest that this patch does not worsen
>> > > the performance. In fact it helps. (at least on ext3 file system.)
>> > >
>> > > Not sure why are you seeing different results with XFS.
>> >
>> > So why didn't you test it with XFS to verify his results?
>>
>> Just got little lazy. Find the testing results with ext3, ext4 and
>> xfs below.
>>
>> > We all know
>> > that different filesystems have different I/O patters, and we have
>> > a history of really nasty regressions in one filesystem by good meaning
>> > changes to the I/O scheduler.
>> >
>> > ext3 in fact is a particularly bad test case as it not only doesn't have
>> > I/O barriers enabled, but also has particularly bad I/O patterns
>> > compared to modern filesystems.

A string of numbers is hard for me to parse. In the hopes that this
will help others, here is some awk-fu that I shamelessly stole from the
internets:

awk '{total1+=$3; total2+=$6; array1[NR]=$3; array2[NR]=$6} END{for(x=1;x<=NR;x++){sumsq1+=((array1[x]-(total1/NR))**2); sumsq2+=((array2[x]-(total2/NR))**2);}print total1/NR " " sqrt(sumsq1/NR) " " total2/NR " " sqrt(sumsq2/NR)}'

>> Ext3 results
>> ============
>> ext3 (2.6.35-rc6) ext3 (35-rc6-fsync)
>> ----------------- -------------------
avg stddev avg stddev
3.8953 2.80654 0.587943 1.22399

>>
>> XFS results
>> ==========
>> XFS (2.6.35-rc6) XFS (with fsync patch)
2.2538 0.95565 2.11869 0.649704

>> Ext4
>> ====
>> ext4 (vanilla) ext4 (patched)
8.57177 9.54596 9.41524 10.4037

> ext3 (barrier=1) ext3 (barrier=1)
2.40316 1.26992 1.82272 0.922305

It is interesting that ext4 does worse with the patch (though,
realistically, not by much).

Cheers,
Jeff