Problem: If a group isn't backlogged, we remove it from the service
tree. When it becomes active again, it gets the minimum vtime of the
tree. That is true even when the group was idle for a very small time,
and it consumed some IO time right before it became idle. If group
has small weight, it can end up using more disk time than its fair
share.
Solution: We solve the problem by assigning the group its old vtime if
it has not been idle long enough. Otherise we assign it the service
tree's min vtime.
Complications: When an entire service tree becomes completely idle, we
lose the vtime state. All the old vtime values are not relevant any
more. For example, consider the case when the service tree is idle and
a brand new group sends IO. That group would have an old vtime value
of zero, but the service tree's vtime would become closer to zero. In
such a case, it would be unfair for the older groups to get a much
higher old vtime stored in them.
We solve that issue by keeping a generation number that counts the
number of instances when the service tree becomes completely
empty. The generation number is stored in each group too. If a group
becomes backlogged after a service tree has been empty, we compare its
stored generation number with the current service tree generation
number, and discard the old vtime if the group generation number is
stale.
The preemption specific code is taken care of automatically because we
allow preemption checks after inserting a group back into the service
tree and assigning it an appropriate vtime.
Signed-off-by: Chad Talbott <[email protected]>
---
block/cfq-iosched.c | 24 +++++++++++++-----------
1 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..f09f3fe 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -178,6 +178,7 @@ struct cfq_group {
/* group service_tree key */
u64 vdisktime;
+ u64 generation_num;
unsigned int weight;
/* number of cfqq currently on this group */
@@ -300,6 +301,9 @@ struct cfq_data {
/* List of cfq groups being managed on this device*/
struct hlist_head cfqg_list;
struct rcu_head rcu;
+
+ /* Generation number, counts service tree empty events */
+ u64 active_generation;
};
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -873,18 +877,13 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (!RB_EMPTY_NODE(&cfqg->rb_node))
return;
- /*
- * Currently put the group at the end. Later implement something
- * so that groups get lesser vtime based on their weights, so that
- * if group does not loose all if it was not continously backlogged.
- */
- n = rb_last(&st->rb);
- if (n) {
- __cfqg = rb_entry_cfqg(n);
- cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
- } else
+ if (cfqd->active_generation > cfqg->generation_num)
cfqg->vdisktime = st->min_vdisktime;
-
+ else
+ /* We assume that vdisktime was not modified when the task
+ was off the service tree.
+ */
+ cfqg->vdisktime = max(st->min_vdisktime, cfqg->vdisktime);
__cfq_group_service_tree_add(st, cfqg);
st->total_weight += cfqg->weight;
}
@@ -906,6 +905,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
cfqg->saved_workload_slice = 0;
+ cfqg->generation_num = cfqd->active_generation;
+ if (RB_EMPTY_ROOT(&st->rb))
+ cfqd->active_generation++;
cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
On Wed, Feb 09, 2011 at 05:32:11PM -0800, Chad Talbott wrote:
> Problem: If a group isn't backlogged, we remove it from the service
> tree. When it becomes active again, it gets the minimum vtime of the
> tree. That is true even when the group was idle for a very small time,
> and it consumed some IO time right before it became idle. If group
> has small weight, it can end up using more disk time than its fair
> share.
>
Hi Chad,
In upstream code once a group gets backlogged we put it at the end
and not at the beginning of the tree. (I am wondering are you looking
at the google internal code :-))
So I don't think that issue of a low weight group getting more disk
time than its fair share is present in upstream kernels.
cfq_group_service_tree_add() {
/*
* Currently put the group at the end. Later implement something
* so that groups get lesser vtime based on their weights, so that
* if group does not loose all if it was not continously
* backlogged.
*/
n = rb_last(&st->rb);
if (n) {
__cfqg = rb_entry_cfqg(n);
cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
} else
cfqg->vdisktime = st->min_vdisktime;
}
Thanks
Vivek
> Solution: We solve the problem by assigning the group its old vtime if
> it has not been idle long enough. Otherise we assign it the service
> tree's min vtime.
>
> Complications: When an entire service tree becomes completely idle, we
> lose the vtime state. All the old vtime values are not relevant any
> more. For example, consider the case when the service tree is idle and
> a brand new group sends IO. That group would have an old vtime value
> of zero, but the service tree's vtime would become closer to zero. In
> such a case, it would be unfair for the older groups to get a much
> higher old vtime stored in them.
>
> We solve that issue by keeping a generation number that counts the
> number of instances when the service tree becomes completely
> empty. The generation number is stored in each group too. If a group
> becomes backlogged after a service tree has been empty, we compare its
> stored generation number with the current service tree generation
> number, and discard the old vtime if the group generation number is
> stale.
>
> The preemption specific code is taken care of automatically because we
> allow preemption checks after inserting a group back into the service
> tree and assigning it an appropriate vtime.
>
> Signed-off-by: Chad Talbott <[email protected]>
> ---
> block/cfq-iosched.c | 24 +++++++++++++-----------
> 1 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 501ffdf..f09f3fe 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -178,6 +178,7 @@ struct cfq_group {
>
> /* group service_tree key */
> u64 vdisktime;
> + u64 generation_num;
> unsigned int weight;
>
> /* number of cfqq currently on this group */
> @@ -300,6 +301,9 @@ struct cfq_data {
> /* List of cfq groups being managed on this device*/
> struct hlist_head cfqg_list;
> struct rcu_head rcu;
> +
> + /* Generation number, counts service tree empty events */
> + u64 active_generation;
> };
>
> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> @@ -873,18 +877,13 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> if (!RB_EMPTY_NODE(&cfqg->rb_node))
> return;
>
> - /*
> - * Currently put the group at the end. Later implement something
> - * so that groups get lesser vtime based on their weights, so that
> - * if group does not loose all if it was not continously backlogged.
> - */
> - n = rb_last(&st->rb);
> - if (n) {
> - __cfqg = rb_entry_cfqg(n);
> - cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> - } else
> + if (cfqd->active_generation > cfqg->generation_num)
> cfqg->vdisktime = st->min_vdisktime;
> -
> + else
> + /* We assume that vdisktime was not modified when the task
> + was off the service tree.
> + */
> + cfqg->vdisktime = max(st->min_vdisktime, cfqg->vdisktime);
> __cfq_group_service_tree_add(st, cfqg);
> st->total_weight += cfqg->weight;
> }
> @@ -906,6 +905,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> if (!RB_EMPTY_NODE(&cfqg->rb_node))
> cfq_rb_erase(&cfqg->rb_node, st);
> cfqg->saved_workload_slice = 0;
> + cfqg->generation_num = cfqd->active_generation;
> + if (RB_EMPTY_ROOT(&st->rb))
> + cfqd->active_generation++;
> cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
> }
>
On Wed, Feb 9, 2011 at 6:09 PM, Vivek Goyal <[email protected]> wrote:
> In upstream code once a group gets backlogged we put it at the end
> and not at the beginning of the tree. (I am wondering are you looking
> at the google internal code :-))
>
> So I don't think that issue of a low weight group getting more disk
> time than its fair share is present in upstream kernels.
You've caught me re-using a commit description. :)
Here's an example of the kind of tests that fail without this patch
(run via the test that Justin and Akshay have posted):
15:35:35 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
15:35:55 INFO Experiment completed in 20.4 seconds
15:35:55 INFO experiment 14 achieved DTFs: 886, 113
15:35:55 INFO experiment 14 FAILED: max observed error is 64, allowed is 50
15:35:55 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
15:36:16 INFO Experiment completed in 20.5 seconds
15:36:16 INFO experiment 15 achieved DTFs: 891, 108
15:36:16 INFO experiment 15 FAILED: max observed error is 59, allowed is 50
Since this is Jens' unmodified tree, I've had to change
BLKIO_WEIGHT_MIN to 10 to allow this test to proceed. We typically
run many jobs with small weights, and achieve the requested isolation:
see below results with this patch:
14:59:17 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
14:59:36 INFO Experiment completed in 19.0 seconds
14:59:36 INFO experiment 14 achieved DTFs: 947, 52
14:59:36 INFO experiment 14 PASSED: max observed error is 3, allowed is 50
14:59:36 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
14:59:55 INFO Experiment completed in 18.5 seconds
14:59:55 INFO experiment 15 achieved DTFs: 944, 55
14:59:55 INFO experiment 15 PASSED: max observed error is 6, allowed is 50
As you can see, it's with seeky workloads that come and go from the
service tree where this patch is required.
Chad
On Wed, Feb 09, 2011 at 06:45:25PM -0800, Chad Talbott wrote:
> On Wed, Feb 9, 2011 at 6:09 PM, Vivek Goyal <[email protected]> wrote:
> > In upstream code once a group gets backlogged we put it at the end
> > and not at the beginning of the tree. (I am wondering are you looking
> > at the google internal code :-))
> >
> > So I don't think that issue of a low weight group getting more disk
> > time than its fair share is present in upstream kernels.
>
> You've caught me re-using a commit description. :)
>
> Here's an example of the kind of tests that fail without this patch
> (run via the test that Justin and Akshay have posted):
>
> 15:35:35 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
> 15:35:55 INFO Experiment completed in 20.4 seconds
> 15:35:55 INFO experiment 14 achieved DTFs: 886, 113
> 15:35:55 INFO experiment 14 FAILED: max observed error is 64, allowed is 50
>
> 15:35:55 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
> 15:36:16 INFO Experiment completed in 20.5 seconds
> 15:36:16 INFO experiment 15 achieved DTFs: 891, 108
> 15:36:16 INFO experiment 15 FAILED: max observed error is 59, allowed is 50
>
> Since this is Jens' unmodified tree, I've had to change
> BLKIO_WEIGHT_MIN to 10 to allow this test to proceed. We typically
> run many jobs with small weights, and achieve the requested isolation:
> see below results with this patch:
>
> 14:59:17 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
> 14:59:36 INFO Experiment completed in 19.0 seconds
> 14:59:36 INFO experiment 14 achieved DTFs: 947, 52
> 14:59:36 INFO experiment 14 PASSED: max observed error is 3, allowed is 50
>
> 14:59:36 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
> 14:59:55 INFO Experiment completed in 18.5 seconds
> 14:59:55 INFO experiment 15 achieved DTFs: 944, 55
> 14:59:55 INFO experiment 15 PASSED: max observed error is 6, allowed is 50
>
> As you can see, it's with seeky workloads that come and go from the
> service tree where this patch is required.
I have not look into or run the tests posted by Justin and Akshay. Can you
give more details about these tests.
Are you running with group_isolation=0 or 1. These tests seem to be random
read and if group_isolation=0 (default), then all the random read queues
should go in root group and there will be no service differentiation.
If you ran different random readers in different groups of differnet
weight with group_isolation=1, then there is a case of having service
differentiation. In that case we will idle for 8ms on each group before
we expire the group. So in these test cases are low weight groups not
submitting IO with-in 8ms? Putting a random reader in separate group
with think time > 8, I think is going to hurt a lot because for every
single IO dispatched group is going to weight for 8ms before it is
expired.
So the only case which comes to my mind where this patch can help is
when there are lots of groups doing IO with different weights. These
groups have think time greater than 8ms and hence get deleted from
service tree. When next time a low weight group has IO, instead of being
put at the end of service tree, it might be put even farther allowing
a higher weight group to get backlogged ahead of it.
Can you run blktrace and verify what's happenig?
Thanks
Vivek
On Wed, Feb 09, 2011 at 05:32:11PM -0800, Chad Talbott wrote:
> Problem: If a group isn't backlogged, we remove it from the service
> tree. When it becomes active again, it gets the minimum vtime of the
> tree. That is true even when the group was idle for a very small time,
> and it consumed some IO time right before it became idle. If group
> has small weight, it can end up using more disk time than its fair
> share.
>
> Solution: We solve the problem by assigning the group its old vtime if
> it has not been idle long enough. Otherise we assign it the service
> tree's min vtime.
>
> Complications: When an entire service tree becomes completely idle, we
> lose the vtime state. All the old vtime values are not relevant any
> more. For example, consider the case when the service tree is idle and
> a brand new group sends IO. That group would have an old vtime value
> of zero, but the service tree's vtime would become closer to zero. In
> such a case, it would be unfair for the older groups to get a much
> higher old vtime stored in them.
>
> We solve that issue by keeping a generation number that counts the
> number of instances when the service tree becomes completely
> empty. The generation number is stored in each group too. If a group
> becomes backlogged after a service tree has been empty, we compare its
> stored generation number with the current service tree generation
> number, and discard the old vtime if the group generation number is
> stale.
>
> The preemption specific code is taken care of automatically because we
> allow preemption checks after inserting a group back into the service
> tree and assigning it an appropriate vtime.
>
> Signed-off-by: Chad Talbott <[email protected]>
> ---
> block/cfq-iosched.c | 24 +++++++++++++-----------
> 1 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 501ffdf..f09f3fe 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -178,6 +178,7 @@ struct cfq_group {
>
> /* group service_tree key */
> u64 vdisktime;
> + u64 generation_num;
> unsigned int weight;
>
> /* number of cfqq currently on this group */
> @@ -300,6 +301,9 @@ struct cfq_data {
> /* List of cfq groups being managed on this device*/
> struct hlist_head cfqg_list;
> struct rcu_head rcu;
> +
> + /* Generation number, counts service tree empty events */
> + u64 active_generation;
> };
>
> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> @@ -873,18 +877,13 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> if (!RB_EMPTY_NODE(&cfqg->rb_node))
> return;
>
> - /*
> - * Currently put the group at the end. Later implement something
> - * so that groups get lesser vtime based on their weights, so that
> - * if group does not loose all if it was not continously backlogged.
> - */
> - n = rb_last(&st->rb);
> - if (n) {
> - __cfqg = rb_entry_cfqg(n);
> - cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> - } else
> + if (cfqd->active_generation > cfqg->generation_num)
> cfqg->vdisktime = st->min_vdisktime;
- What happens when cfqd->active_generation wraps over? Should we use
functions which take care of wrapping.
- So when generation number changes you want to put the newly backlogged
group at the front of tree and not at the end of it? Though it kind
of make sense as any continuously backlogged groups will be on service
tree for long time and newly backlogged groups are either new IO
starting or some application which high think time and which does IO
one in a while and does not keep disk occupied for very long time. In
such cases it probably is not a bad idea to put newly backlogged
groups at the beginning of the tree.
> -
> + else
> + /* We assume that vdisktime was not modified when the task
> + was off the service tree.
> + */
> + cfqg->vdisktime = max(st->min_vdisktime, cfqg->vdisktime);
> __cfq_group_service_tree_add(st, cfqg);
> st->total_weight += cfqg->weight;
> }
> @@ -906,6 +905,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> if (!RB_EMPTY_NODE(&cfqg->rb_node))
> cfq_rb_erase(&cfqg->rb_node, st);
> cfqg->saved_workload_slice = 0;
> + cfqg->generation_num = cfqd->active_generation;
> + if (RB_EMPTY_ROOT(&st->rb))
> + cfqd->active_generation++;
I don't understand that what is the significance behind chaning generation
number when tree is idle? When tree is idle does not mean that few
recently deleted groups will not get backlogged soon?
Thanks
Vivek
On Wed, Feb 9, 2011 at 7:57 PM, Vivek Goyal <[email protected]> wrote:
> On Wed, Feb 09, 2011 at 06:45:25PM -0800, Chad Talbott wrote:
>> On Wed, Feb 9, 2011 at 6:09 PM, Vivek Goyal <[email protected]> wrote:
>> > In upstream code once a group gets backlogged we put it at the end
>> > and not at the beginning of the tree. (I am wondering are you looking
>> > at the google internal code :-))
>> >
>> > So I don't think that issue of a low weight group getting more disk
>> > time than its fair share is present in upstream kernels.
>>
>> You've caught me re-using a commit description. ?:)
>>
>> Here's an example of the kind of tests that fail without this patch
>> (run via the test that Justin and Akshay have posted):
>>
>> 15:35:35 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
>> 15:35:55 INFO Experiment completed in 20.4 seconds
>> 15:35:55 INFO experiment 14 achieved DTFs: 886, 113
>> 15:35:55 INFO experiment 14 FAILED: max observed error is 64, allowed is 50
>>
>> 15:35:55 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
>> 15:36:16 INFO Experiment completed in 20.5 seconds
>> 15:36:16 INFO experiment 15 achieved DTFs: 891, 108
>> 15:36:16 INFO experiment 15 FAILED: max observed error is 59, allowed is 50
>>
>> Since this is Jens' unmodified tree, I've had to change
>> BLKIO_WEIGHT_MIN to 10 to allow this test to proceed. ?We typically
>> run many jobs with small weights, and achieve the requested isolation:
>> see below results with this patch:
>>
>> 14:59:17 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
>> 14:59:36 INFO Experiment completed in 19.0 seconds
>> 14:59:36 INFO experiment 14 achieved DTFs: 947, 52
>> 14:59:36 INFO experiment 14 PASSED: max observed error is 3, allowed is 50
>>
>> 14:59:36 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
>> 14:59:55 INFO Experiment completed in 18.5 seconds
>> 14:59:55 INFO experiment 15 achieved DTFs: 944, 55
>> 14:59:55 INFO experiment 15 PASSED: max observed error is 6, allowed is 50
>>
>> As you can see, it's with seeky workloads that come and go from the
>> service tree where this patch is required.
>
> I have not look into or run the tests posted by Justin and Akshay. Can you
> give more details about these tests.
> Are you running with group_isolation=0 or 1. These tests seem to be random
> read and if group_isolation=0 (default), then all the random read queues
> should go in root group and there will be no service differentiation.
The test sets group_isolation=1 as part of its setup, as this is our
standard configuration.
> If you ran different random readers in different groups of differnet
> weight with group_isolation=1, then there is a case of having service
> differentiation. In that case we will idle for 8ms on each group before
> we expire the group. So in these test cases are low weight groups not
> submitting IO with-in 8ms? Putting a random reader in separate group
> with think time > 8, I think is going to hurt a lot because for every
> single IO dispatched group is going to weight for 8ms before it is
> expired.
You're right about the behavior of group_idle. We have more
experience with earlier kernels (before group_idle). With this patch
we are able to achieve isolation without group_idle even with these
large ratios. (Without group_idle the random reader workloads will
get marked seeky, and idling is disabled. Without group_idle, we have
to remember the vdisktime to get isolation.)
> Can you run blktrace and verify what's happenig?
I can run a blktrace, and I think it will show what you expect.
Chad
On Wed, Feb 9, 2011 at 8:02 PM, Vivek Goyal <[email protected]> wrote:
> - What happens when cfqd->active_generation wraps over? Should we use
> ?functions which take care of wrapping.
To be absolutely correct, you're right. But even if the service tree
becomes completely idle every microsecond, we still have 1/2 million
years before the first wrap.
> - So when generation number changes you want to put the newly backlogged
> ?group at the front of tree and not at the end of it? Though it kind
> ?of make sense as any continuously backlogged groups will be on service
> ?tree for long time and newly backlogged groups are either new IO
> ?starting or some application which high think time and which does IO
> ?one in a while and does not keep disk occupied for very long time. In
> ?such cases it probably is not a bad idea to put newly backlogged
> ?groups at the beginning of the tree.
Yes, that's it exactly.
>> @@ -906,6 +905,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> ? ? ? if (!RB_EMPTY_NODE(&cfqg->rb_node))
>> ? ? ? ? ? ? ? cfq_rb_erase(&cfqg->rb_node, st);
>> ? ? ? cfqg->saved_workload_slice = 0;
>> + ? ? cfqg->generation_num = cfqd->active_generation;
>> + ? ? if (RB_EMPTY_ROOT(&st->rb))
>> + ? ? ? ? ? ? cfqd->active_generation++;
>
> I don't understand that what is the significance behind chaning generation
> number when tree is idle? When tree is idle does not mean that few
> recently deleted groups will not get backlogged soon?
It's a result of being both fair and work conserving. If *all* the
queues are without requests, then whoever next has requests should get
to use the disk fully to be work-conserving. But if the disk is
always kept busy (by anyone), then there is competition and we have to
provide fairness.
Chad
On Thu, Feb 10, 2011 at 10:57 AM, Chad Talbott <[email protected]> wrote:
> On Wed, Feb 9, 2011 at 7:57 PM, Vivek Goyal <[email protected]> wrote:
>> If you ran different random readers in different groups of differnet
>> weight with group_isolation=1, then there is a case of having service
>> differentiation. In that case we will idle for 8ms on each group before
>> we expire the group. So in these test cases are low weight groups not
>> submitting IO with-in 8ms? Putting a random reader in separate group
>> with think time > 8, I think is going to hurt a lot because for every
>> single IO dispatched group is going to weight for 8ms before it is
>> expired.
>
> You're right about the behavior of group_idle. ?We have more
> experience with earlier kernels (before group_idle). ?With this patch
> we are able to achieve isolation without group_idle even with these
> large ratios. ?(Without group_idle the random reader workloads will
> get marked seeky, and idling is disabled. ?Without group_idle, we have
> to remember the vdisktime to get isolation.)
>
>> Can you run blktrace and verify what's happenig?
>
> I can run a blktrace, and I think it will show what you expect.
So, I ran the following two tests and took a blktrace.
950 rdrand, 50 rdrand.delay10
weight 950 random reader with low think time vs weight 50 random
reader with 10ms think time
950 rdrand, 50 rdrand.delay50 # 50ms think time
weight 950 random reader with low think time vs weight 50 random
reader with 50ms think time
I find that we are still idling for these random readers, even the one
with 50ms think time. group_idle is 0 according to blktrace.
With this patch, both of these cases have correct isolation. Without
this patch, the small weight reader is able to get more than its
share.
I think that idling for a random reader with a 50ms think time is
likely a bug, but a separate issue.
Chad
On Thu, Feb 10, 2011 at 04:36:25PM -0800, Chad Talbott wrote:
> On Thu, Feb 10, 2011 at 10:57 AM, Chad Talbott <[email protected]> wrote:
> > On Wed, Feb 9, 2011 at 7:57 PM, Vivek Goyal <[email protected]> wrote:
> >> If you ran different random readers in different groups of differnet
> >> weight with group_isolation=1, then there is a case of having service
> >> differentiation. In that case we will idle for 8ms on each group before
> >> we expire the group. So in these test cases are low weight groups not
> >> submitting IO with-in 8ms? Putting a random reader in separate group
> >> with think time > 8, I think is going to hurt a lot because for every
> >> single IO dispatched group is going to weight for 8ms before it is
> >> expired.
> >
> > You're right about the behavior of group_idle. ?We have more
> > experience with earlier kernels (before group_idle). ?With this patch
> > we are able to achieve isolation without group_idle even with these
> > large ratios. ?(Without group_idle the random reader workloads will
> > get marked seeky, and idling is disabled. ?Without group_idle, we have
> > to remember the vdisktime to get isolation.)
> >
> >> Can you run blktrace and verify what's happenig?
> >
> > I can run a blktrace, and I think it will show what you expect.
>
> So, I ran the following two tests and took a blktrace.
>
> 950 rdrand, 50 rdrand.delay10
> weight 950 random reader with low think time vs weight 50 random
> reader with 10ms think time
>
> 950 rdrand, 50 rdrand.delay50 # 50ms think time
> weight 950 random reader with low think time vs weight 50 random
> reader with 50ms think time
>
> I find that we are still idling for these random readers, even the one
> with 50ms think time. group_idle is 0 according to blktrace.
>
> With this patch, both of these cases have correct isolation. Without
> this patch, the small weight reader is able to get more than its
> share.
>
> I think that idling for a random reader with a 50ms think time is
> likely a bug, but a separate issue.
Thanks for checking this out. I agree that for a low weight random
reader/writer which high think time, we need to remember the vdisktime
otherwise it will showup as a fresh new candidate and get more done.
Having said that, one can say that random reader/writer doing small
amount of IO should be able to get job done really fast and the one
who are hogging the disk for long time, should get higher vdisktime.
So with this scheme, a random reader/writer shall have to be of higher
weight to get the job done fast. A low weight reader/writer will still
get higher vdisktime and get lesser share. I think it is reasonable.
And yes, even with group_idle=0 if we are idling on a 50ms thinktime
random reader it sounds like a bug.
Thanks
Vivek
On Thu, Feb 10, 2011 at 11:06:37AM -0800, Chad Talbott wrote:
> On Wed, Feb 9, 2011 at 8:02 PM, Vivek Goyal <[email protected]> wrote:
> > - What happens when cfqd->active_generation wraps over? Should we use
> > ?functions which take care of wrapping.
>
> To be absolutely correct, you're right. But even if the service tree
> becomes completely idle every microsecond, we still have 1/2 million
> years before the first wrap.
Ok, so I guess it fine for time being untile and unless we encounter a
really fast device.
>
> > - So when generation number changes you want to put the newly backlogged
> > ?group at the front of tree and not at the end of it? Though it kind
> > ?of make sense as any continuously backlogged groups will be on service
> > ?tree for long time and newly backlogged groups are either new IO
> > ?starting or some application which high think time and which does IO
> > ?one in a while and does not keep disk occupied for very long time. In
> > ?such cases it probably is not a bad idea to put newly backlogged
> > ?groups at the beginning of the tree.
>
> Yes, that's it exactly.
>
> >> @@ -906,6 +905,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> ? ? ? if (!RB_EMPTY_NODE(&cfqg->rb_node))
> >> ? ? ? ? ? ? ? cfq_rb_erase(&cfqg->rb_node, st);
> >> ? ? ? cfqg->saved_workload_slice = 0;
> >> + ? ? cfqg->generation_num = cfqd->active_generation;
> >> + ? ? if (RB_EMPTY_ROOT(&st->rb))
> >> + ? ? ? ? ? ? cfqd->active_generation++;
> >
> > I don't understand that what is the significance behind chaning generation
> > number when tree is idle? When tree is idle does not mean that few
> > recently deleted groups will not get backlogged soon?
>
> It's a result of being both fair and work conserving. If *all* the
> queues are without requests, then whoever next has requests should get
> to use the disk fully to be work-conserving. But if the disk is
> always kept busy (by anyone), then there is competition and we have to
> provide fairness.
Well, service tree being idle does not mean that nobody else is using
disk. We could have the case where service tree is idle in between while
all the readers are in their think time phase. In that case I think one
could further refine the logic where change the generation only if tree is
idle for more than slice_idle time.
But I think we can do that later if we run into some issues. For the
time being I am fine with it as it has been working for you guys.
Can you please update the patch to correct the commit description and
repost.
Thanks
Vivek
On Fri, Feb 11, 2011 at 01:15:33PM -0500, Vivek Goyal wrote:
> On Thu, Feb 10, 2011 at 04:36:25PM -0800, Chad Talbott wrote:
> > On Thu, Feb 10, 2011 at 10:57 AM, Chad Talbott <[email protected]> wrote:
> > > On Wed, Feb 9, 2011 at 7:57 PM, Vivek Goyal <[email protected]> wrote:
> > >> If you ran different random readers in different groups of differnet
> > >> weight with group_isolation=1, then there is a case of having service
> > >> differentiation. In that case we will idle for 8ms on each group before
> > >> we expire the group. So in these test cases are low weight groups not
> > >> submitting IO with-in 8ms? Putting a random reader in separate group
> > >> with think time > 8, I think is going to hurt a lot because for every
> > >> single IO dispatched group is going to weight for 8ms before it is
> > >> expired.
> > >
> > > You're right about the behavior of group_idle. ?We have more
> > > experience with earlier kernels (before group_idle). ?With this patch
> > > we are able to achieve isolation without group_idle even with these
> > > large ratios. ?(Without group_idle the random reader workloads will
> > > get marked seeky, and idling is disabled. ?Without group_idle, we have
> > > to remember the vdisktime to get isolation.)
> > >
> > >> Can you run blktrace and verify what's happenig?
> > >
> > > I can run a blktrace, and I think it will show what you expect.
> >
> > So, I ran the following two tests and took a blktrace.
> >
> > 950 rdrand, 50 rdrand.delay10
> > weight 950 random reader with low think time vs weight 50 random
> > reader with 10ms think time
> >
> > 950 rdrand, 50 rdrand.delay50 # 50ms think time
> > weight 950 random reader with low think time vs weight 50 random
> > reader with 50ms think time
> >
> > I find that we are still idling for these random readers, even the one
> > with 50ms think time. group_idle is 0 according to blktrace.
> >
> > With this patch, both of these cases have correct isolation. Without
> > this patch, the small weight reader is able to get more than its
> > share.
> >
> > I think that idling for a random reader with a 50ms think time is
> > likely a bug, but a separate issue.
>
> Thanks for checking this out. I agree that for a low weight random
> reader/writer which high think time, we need to remember the vdisktime
> otherwise it will showup as a fresh new candidate and get more done.
>
> Having said that, one can say that random reader/writer doing small
> amount of IO should be able to get job done really fast and the one
> who are hogging the disk for long time, should get higher vdisktime.
>
> So with this scheme, a random reader/writer shall have to be of higher
> weight to get the job done fast. A low weight reader/writer will still
> get higher vdisktime and get lesser share. I think it is reasonable.
>
> And yes, even with group_idle=0 if we are idling on a 50ms thinktime
> random reader it sounds like a bug.
Thinking more about it, I think it must be happening because of the fact
that random IO goes on sync-noidle tree of group and there we idle on
whole tree. I think if you set slice_idle=0 along with group_idle=0, that
idling should go away.
Thanks
Vivek