LinuxLists.cc - [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

2009-12-11 05:06:30

Subject: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Currently, with IO Controller introduced, CFQ chooses cfq group
at the top, and then choose service tree. So we need to take
whether cfq group is changed into account to decide whether we
should choose service tree start from scratch.

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 27 ++++++++++++++++++---------
1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f3f6239..16084ca 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1964,7 +1964,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)

static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio,
- bool prio_changed)
+ bool cfqg_changed, bool prio_changed)
{
struct cfq_queue *queue;
int i;
@@ -1972,7 +1972,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
unsigned long lowest_key = 0;
enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;

- if (prio_changed) {
+ if (cfqg_changed || prio_changed) {
/*
* When priorities switched, we prefer starting
* from SYNC_NOIDLE (first choice), or just SYNC
@@ -2001,7 +2001,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
return cur_best;
}

-static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void choose_service_tree(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, bool cfqg_changed)
{
enum wl_prio_t previous_prio = cfqd->serving_prio;
bool prio_changed;
@@ -2033,21 +2034,22 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
* expiration time
*/
prio_changed = (cfqd->serving_prio != previous_prio);
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
- cfqd);
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
+ cfqd);
count = st->count;

/*
* If priority didn't change, check workload expiration,
* and that we still have other queues ready
*/
- if (!prio_changed && count &&
+ if (!cfqg_changed && !prio_changed && count &&
!time_after(jiffies, cfqd->workload_expires))
return;

/* otherwise select new workload type */
cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio,
+ cfqg_changed, prio_changed);
st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
cfqd);
count = st->count;
@@ -2104,9 +2106,16 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)

static void cfq_choose_cfqg(struct cfq_data *cfqd)
{
+ bool cfqg_changed = false;
+
+ struct cfq_group *orig_cfqg = cfqd->serving_group;
+
struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);

- cfqd->serving_group = cfqg;
+ if (orig_cfqg != cfqg) {
+ cfqg_changed = 1;
+ cfqd->serving_group = cfqg;
+ }

/* Restore the workload type data */
if (cfqg->saved_workload_slice) {
@@ -2114,7 +2123,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
cfqd->serving_type = cfqg->saved_workload;
cfqd->serving_prio = cfqg->saved_serving_prio;
}
- choose_service_tree(cfqd, cfqg);
+ choose_service_tree(cfqd, cfqg, cfqg_changed);
}

/*
--
1.5.4.rc3

2009-12-11 15:07:30

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
> Currently, with IO Controller introduced, CFQ chooses cfq group
> at the top, and then choose service tree. So we need to take
> whether cfq group is changed into account to decide whether we
> should choose service tree start from scratch.
>

I am not able to understand the need/purpose of this patch. Once we
switched the group during scheduling, why should we reset the order
of workload with-in group. In fact we don't want to do that and we
want to start from where we left so that no workload with-in group is
starved.

When a group is resumed, choose_service_tree() always checks if any RT
queue got backlogged or not. If yes, it does choose that first.

Can you give more details why do you want to do this change?

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <[email protected]>
> ---
> block/cfq-iosched.c | 27 ++++++++++++++++++---------
> 1 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index f3f6239..16084ca 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1964,7 +1964,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>
> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> struct cfq_group *cfqg, enum wl_prio_t prio,
> - bool prio_changed)
> + bool cfqg_changed, bool prio_changed)
> {
> struct cfq_queue *queue;
> int i;
> @@ -1972,7 +1972,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> unsigned long lowest_key = 0;
> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>
> - if (prio_changed) {
> + if (cfqg_changed || prio_changed) {
> /*
> * When priorities switched, we prefer starting
> * from SYNC_NOIDLE (first choice), or just SYNC
> @@ -2001,7 +2001,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> return cur_best;
> }
>
> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> +static void choose_service_tree(struct cfq_data *cfqd,
> + struct cfq_group *cfqg, bool cfqg_changed)
> {
> enum wl_prio_t previous_prio = cfqd->serving_prio;
> bool prio_changed;
> @@ -2033,21 +2034,22 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> * expiration time
> */
> prio_changed = (cfqd->serving_prio != previous_prio);
> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> - cfqd);
> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> + cfqd);
> count = st->count;
>
> /*
> * If priority didn't change, check workload expiration,
> * and that we still have other queues ready
> */
> - if (!prio_changed && count &&
> + if (!cfqg_changed && !prio_changed && count &&
> !time_after(jiffies, cfqd->workload_expires))
> return;
>
> /* otherwise select new workload type */
> cfqd->serving_type =
> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio,
> + cfqg_changed, prio_changed);
> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> cfqd);
> count = st->count;
> @@ -2104,9 +2106,16 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>
> static void cfq_choose_cfqg(struct cfq_data *cfqd)
> {
> + bool cfqg_changed = false;
> +
> + struct cfq_group *orig_cfqg = cfqd->serving_group;
> +
> struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
>
> - cfqd->serving_group = cfqg;
> + if (orig_cfqg != cfqg) {
> + cfqg_changed = 1;
> + cfqd->serving_group = cfqg;
> + }
>
> /* Restore the workload type data */
> if (cfqg->saved_workload_slice) {
> @@ -2114,7 +2123,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> cfqd->serving_type = cfqg->saved_workload;
> cfqd->serving_prio = cfqg->saved_serving_prio;
> }
> - choose_service_tree(cfqd, cfqg);
> + choose_service_tree(cfqd, cfqg, cfqg_changed);
> }
>
> /*
> --
> 1.5.4.rc3
>

2009-12-11 18:01:10

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Hi guys,
On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
>> Currently, with IO Controller introduced, CFQ chooses cfq group
>> at the top, and then choose service tree. So we need to take
>> whether cfq group is changed into account to decide whether we
>> should choose service tree start from scratch.
>>
>
> I am not able to understand the need/purpose of this patch. Once we
> switched the group during scheduling, why should we reset the order
> of workload with-in group.

I understand it, and in fact I was thinking about this.
The idea is the same as with priorities. If we have not serviced a group
for a while, we want to start with the no-idle workload to reduce its latency.

Unfortunately, a group may have a too small share, that could cause some
workloads to be starved, as Vivek correctly points out, and we should
avoid this.
It should be easily reproducible testing a "many groups with mixed workloads"
scenario with group_isolation=1.

Moreover, even if the approach is groups on top, when group isolation
is not set (i.e. the default), in the non-root groups you will only
have the sync-idle
queues, so it is much more similar (logically) to a workload on top
than it appears
from the code.

I think the net result of this patch is, when group isolation is not
set, to select no-idle
workload first only when entering the root group, thus a slight
penalization of the
async workload.
Gui, if you observed improvements with this patch, probably you can obtain them
without the starvation drawback by making it conditional to !group_isolation.

BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
can have just one argument, called e.g. boost_no_idle, and you pass
(!group_isolation && cfqg_changed) || prio_changed.

> In fact we don't want to do that and we
> want to start from where we left so that no workload with-in group is
> starved.

Agreed.

Thanks,
Corrado

>
> When a group is resumed, choose_service_tree() always checks if any RT
> queue got backlogged or not. If yes, it does choose that first.
>
> Can you give more details why do you want to do this change?
>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <[email protected]>
>> ---
>> block/cfq-iosched.c | 27 ++++++++++++++++++---------
>> 1 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index f3f6239..16084ca 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1964,7 +1964,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>>
>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio,
>> - bool prio_changed)
>> + bool cfqg_changed, bool prio_changed)
>> {
>> struct cfq_queue *queue;
>> int i;
>> @@ -1972,7 +1972,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> unsigned long lowest_key = 0;
>> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>>
>> - if (prio_changed) {
>> + if (cfqg_changed || prio_changed) {
>> /*
>> * When priorities switched, we prefer starting
>> * from SYNC_NOIDLE (first choice), or just SYNC
>> @@ -2001,7 +2001,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> return cur_best;
>> }
>>
>> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> +static void choose_service_tree(struct cfq_data *cfqd,
>> + struct cfq_group *cfqg, bool cfqg_changed)
>> {
>> enum wl_prio_t previous_prio = cfqd->serving_prio;
>> bool prio_changed;
>> @@ -2033,21 +2034,22 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> * expiration time
>> */
>> prio_changed = (cfqd->serving_prio != previous_prio);
>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>> - cfqd);
>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>> + cfqd);
>> count = st->count;
>>
>> /*
>> * If priority didn't change, check workload expiration,
>> * and that we still have other queues ready
>> */
>> - if (!prio_changed && count &&
>> + if (!cfqg_changed && !prio_changed && count &&
>> !time_after(jiffies, cfqd->workload_expires))
>> return;
>>
>> /* otherwise select new workload type */
>> cfqd->serving_type =
>> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
>> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio,
>> + cfqg_changed, prio_changed);
>> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>> cfqd);
>> count = st->count;
>> @@ -2104,9 +2106,16 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>>
>> static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> {
>> + bool cfqg_changed = false;
>> +
>> + struct cfq_group *orig_cfqg = cfqd->serving_group;
>> +
>> struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
>>
>> - cfqd->serving_group = cfqg;
>> + if (orig_cfqg != cfqg) {
>> + cfqg_changed = 1;
>> + cfqd->serving_group = cfqg;
>> + }
>>
>> /* Restore the workload type data */
>> if (cfqg->saved_workload_slice) {
>> @@ -2114,7 +2123,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> cfqd->serving_type = cfqg->saved_workload;
>> cfqd->serving_prio = cfqg->saved_serving_prio;
>> }
>> - choose_service_tree(cfqd, cfqg);
>> + choose_service_tree(cfqd, cfqg, cfqg_changed);
>> }
>>
>> /*
>> --
>> 1.5.4.rc3
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-12-11 18:46:34

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

On Fri, Dec 11, 2009 at 07:01:14PM +0100, Corrado Zoccolo wrote:
> Hi guys,
> On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
> >> Currently, with IO Controller introduced, CFQ chooses cfq group
> >> at the top, and then choose service tree. So we need to take
> >> whether cfq group is changed into account to decide whether we
> >> should choose service tree start from scratch.
> >>
> >
> > I am not able to understand the need/purpose of this patch. Once we
> > switched the group during scheduling, why should we reset the order
> > of workload with-in group.
>
> I understand it, and in fact I was thinking about this.
> The idea is the same as with priorities. If we have not serviced a group
> for a while, we want to start with the no-idle workload to reduce its latency.
>
> Unfortunately, a group may have a too small share, that could cause some
> workloads to be starved, as Vivek correctly points out, and we should
> avoid this.
> It should be easily reproducible testing a "many groups with mixed workloads"
> scenario with group_isolation=1.
>
> Moreover, even if the approach is groups on top, when group isolation
> is not set (i.e. the default), in the non-root groups you will only
> have the sync-idle
> queues, so it is much more similar (logically) to a workload on top
> than it appears
> from the code.
>
> I think the net result of this patch is, when group isolation is not
> set, to select no-idle
> workload first only when entering the root group, thus a slight
> penalization of the
> async workload.

Also penalization of sync-idle workload in root group.

The higher latencies for sync-noidle workload with-in a group will be
observed only if group_isolation=1 and if group has low weight. I think
in general this problem should be solved by bumping up the weight of the
group, otherwise just live with it to ensure fairness for sync-idle
workload in the group.

With group_isolation=0, the problem should be less visible as root group
runs with max weight in the system (1000). In general I will assume that
one will choose system weights in such a manner so that root group does
not starve.

> Gui, if you observed improvements with this patch, probably you can obtain them
> without the starvation drawback by making it conditional to !group_isolation.
>
> BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
> can have just one argument, called e.g. boost_no_idle, and you pass
> (!group_isolation && cfqg_changed) || prio_changed.
>

Because it will impact the share of sync-idle workload in root group and
asyn workload systemwide, I am not too keen on doing this change. I would
rather rely on root group being higher weight.

So far this reset of workload order happens only if we decide to higher
prio class task. So if there is some RT activity happening in system, we
will constantly be resetting the workload order for BE class and keep on
servicing sync-noidle workload while starving sync-idle and async
workload.

So it is probably still ok for priority class switching, but I am not too keen
on doing this at group switching event. This event will be too frequent if
there are significant number of groups in the system and we don't want to
starve root group sync-idle and system wide async workload.

If somebody is not happy with latecies of sync-noidle workload, then
easier way to solve that issue is readjust weights of groups.

Thanks
Vivek

> > In fact we don't want to do that and we
> > want to start from where we left so that no workload with-in group is
> > starved.
>
> Agreed.
>
> Thanks,
> Corrado
>
> >
> > When a group is resumed, choose_service_tree() always checks if any RT
> > queue got backlogged or not. If yes, it does choose that first.
> >
> > Can you give more details why do you want to do this change?
> >
> > Thanks
> > Vivek
> >
> >> Signed-off-by: Gui Jianfeng <[email protected]>
> >> ---
> >> ?block/cfq-iosched.c | ? 27 ++++++++++++++++++---------
> >> ?1 files changed, 18 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index f3f6239..16084ca 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -1964,7 +1964,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> >>
> >> ?static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cfq_group *cfqg, enum wl_prio_t prio,
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool prio_changed)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool cfqg_changed, bool prio_changed)
> >> ?{
> >> ? ? ? struct cfq_queue *queue;
> >> ? ? ? int i;
> >> @@ -1972,7 +1972,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> ? ? ? unsigned long lowest_key = 0;
> >> ? ? ? enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
> >>
> >> - ? ? if (prio_changed) {
> >> + ? ? if (cfqg_changed || prio_changed) {
> >> ? ? ? ? ? ? ? /*
> >> ? ? ? ? ? ? ? ?* When priorities switched, we prefer starting
> >> ? ? ? ? ? ? ? ?* from SYNC_NOIDLE (first choice), or just SYNC
> >> @@ -2001,7 +2001,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> ? ? ? return cur_best;
> >> ?}
> >>
> >> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> +static void choose_service_tree(struct cfq_data *cfqd,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cfq_group *cfqg, bool cfqg_changed)
> >> ?{
> >> ? ? ? enum wl_prio_t previous_prio = cfqd->serving_prio;
> >> ? ? ? bool prio_changed;
> >> @@ -2033,21 +2034,22 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> ? ? ? ?* expiration time
> >> ? ? ? ?*/
> >> ? ? ? prio_changed = (cfqd->serving_prio != previous_prio);
> >> - ? ? st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfqd);
> >> + ? ? st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? cfqd);
> >> ? ? ? count = st->count;
> >>
> >> ? ? ? /*
> >> ? ? ? ?* If priority didn't change, check workload expiration,
> >> ? ? ? ?* and that we still have other queues ready
> >> ? ? ? ?*/
> >> - ? ? if (!prio_changed && count &&
> >> + ? ? if (!cfqg_changed && !prio_changed && count &&
> >> ? ? ? ? ? !time_after(jiffies, cfqd->workload_expires))
> >> ? ? ? ? ? ? ? return;
> >>
> >> ? ? ? /* otherwise select new workload type */
> >> ? ? ? cfqd->serving_type =
> >> - ? ? ? ? ? ? cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
> >> + ? ? ? ? ? ? cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? cfqg_changed, prio_changed);
> >> ? ? ? st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfqd);
> >> ? ? ? count = st->count;
> >> @@ -2104,9 +2106,16 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
> >>
> >> ?static void cfq_choose_cfqg(struct cfq_data *cfqd)
> >> ?{
> >> + ? ? bool cfqg_changed = false;
> >> +
> >> + ? ? struct cfq_group *orig_cfqg = cfqd->serving_group;
> >> +
> >> ? ? ? struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
> >>
> >> - ? ? cfqd->serving_group = cfqg;
> >> + ? ? if (orig_cfqg != cfqg) {
> >> + ? ? ? ? ? ? cfqg_changed = 1;
> >> + ? ? ? ? ? ? cfqd->serving_group = cfqg;
> >> + ? ? }
> >>
> >> ? ? ? /* Restore the workload type data */
> >> ? ? ? if (cfqg->saved_workload_slice) {
> >> @@ -2114,7 +2123,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> >> ? ? ? ? ? ? ? cfqd->serving_type = cfqg->saved_workload;
> >> ? ? ? ? ? ? ? cfqd->serving_prio = cfqg->saved_serving_prio;
> >> ? ? ? }
> >> - ? ? choose_service_tree(cfqd, cfqg);
> >> + ? ? choose_service_tree(cfqd, cfqg, cfqg_changed);
> >> ?}
> >>
> >> ?/*
> >> --
> >> 1.5.4.rc3
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
> Tales of Power - C. Castaneda

2009-12-14 02:42:08

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Vivek Goyal wrote:
> On Fri, Dec 11, 2009 at 07:01:14PM +0100, Corrado Zoccolo wrote:
>> Hi guys,
>> On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
>>> On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
>>>> Currently, with IO Controller introduced, CFQ chooses cfq group
>>>> at the top, and then choose service tree. So we need to take
>>>> whether cfq group is changed into account to decide whether we
>>>> should choose service tree start from scratch.
>>>>
>>> I am not able to understand the need/purpose of this patch. Once we
>>> switched the group during scheduling, why should we reset the order
>>> of workload with-in group.
>> I understand it, and in fact I was thinking about this.
>> The idea is the same as with priorities. If we have not serviced a group
>> for a while, we want to start with the no-idle workload to reduce its latency.
>>
>> Unfortunately, a group may have a too small share, that could cause some
>> workloads to be starved, as Vivek correctly points out, and we should
>> avoid this.
>> It should be easily reproducible testing a "many groups with mixed workloads"
>> scenario with group_isolation=1.
>>
>> Moreover, even if the approach is groups on top, when group isolation
>> is not set (i.e. the default), in the non-root groups you will only
>> have the sync-idle
>> queues, so it is much more similar (logically) to a workload on top
>> than it appears
>> from the code.
>>
>> I think the net result of this patch is, when group isolation is not
>> set, to select no-idle
>> workload first only when entering the root group, thus a slight
>> penalization of the
>> async workload.
>
> Also penalization of sync-idle workload in root group.
>
> The higher latencies for sync-noidle workload with-in a group will be
> observed only if group_isolation=1 and if group has low weight. I think
> in general this problem should be solved by bumping up the weight of the
> group, otherwise just live with it to ensure fairness for sync-idle
> workload in the group.
>
> With group_isolation=0, the problem should be less visible as root group
> runs with max weight in the system (1000). In general I will assume that
> one will choose system weights in such a manner so that root group does
> not starve.
>
>
>> Gui, if you observed improvements with this patch, probably you can obtain them
>> without the starvation drawback by making it conditional to !group_isolation.
>>
>> BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
>> can have just one argument, called e.g. boost_no_idle, and you pass
>> (!group_isolation && cfqg_changed) || prio_changed.
>>
>
> Because it will impact the share of sync-idle workload in root group and
> asyn workload systemwide, I am not too keen on doing this change. I would
> rather rely on root group being higher weight.
>
> So far this reset of workload order happens only if we decide to higher
> prio class task. So if there is some RT activity happening in system, we
> will constantly be resetting the workload order for BE class and keep on
> servicing sync-noidle workload while starving sync-idle and async
> workload.
>
> So it is probably still ok for priority class switching, but I am not too keen
> on doing this at group switching event. This event will be too frequent if
> there are significant number of groups in the system and we don't want to
> starve root group sync-idle and system wide async workload.
>
> If somebody is not happy with latecies of sync-noidle workload, then
> easier way to solve that issue is readjust weights of groups.
>
> Thanks
> Vivek

Hi Vivek, Corrado

Thanks for Corrado's explanation, i have the same concern.

Consider the following code,

prio_changed = (cfqd->serving_prio != previous_prio);
st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
cfqd);
count = st->count;
/*
* If priority didn't change, check workload expiration,
* and that we still have other queues ready
*/
if (!prio_changed && count &&
!time_after(jiffies, cfqd->workload_expires))
return;

One more thing i do this change is that currently if serving_prio isn't changed,
we'll start from where we left. But with io group introduced, cfqd->serving_prio and
previous_prio might come from different groups. So i don't think "prio_changed" still
makes sense in the case of group changing. cfqd->workload_expires also might come from
the workload from another group when deciding whether starting from "cfqd->serving_type".

Thanks,
Gui

2009-12-14 08:39:35

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Hi,
On Mon, Dec 14, 2009 at 3:37 AM, Gui Jianfeng
<[email protected]> wrote:
> Vivek Goyal wrote:
>> On Fri, Dec 11, 2009 at 07:01:14PM +0100, Corrado Zoccolo wrote:
>>> Hi guys,
>>> On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
>>>> On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
>>>>> Currently, with IO Controller introduced, CFQ chooses cfq group
>>>>> at the top, and then choose service tree. So we need to take
>>>>> whether cfq group is changed into account to decide whether we
>>>>> should choose service tree start from scratch.
>>>>>
>>>> I am not able to understand the need/purpose of this patch. Once we
>>>> switched the group during scheduling, why should we reset the order
>>>> of workload with-in group.
>>> I understand it, and in fact I was thinking about this.
>>> The idea is the same as with priorities. If we have not serviced a group
>>> for a while, we want to start with the no-idle workload to reduce its latency.
>>>
>>> Unfortunately, a group may have a too small share, that could cause some
>>> workloads to be starved, as Vivek correctly points out, and we should
>>> avoid this.
>>> It should be easily reproducible testing a "many groups with mixed workloads"
>>> scenario with group_isolation=1.
>>>
>>> Moreover, even if the approach is groups on top, when group isolation
>>> is not set (i.e. the default), in the non-root groups you will only
>>> have the sync-idle
>>> queues, so it is much more similar (logically) to a workload on top
>>> than it appears
>>> from the code.
>>>
>>> I think the net result of this patch is, when group isolation is not
>>> set, to select no-idle
>>> workload first only when entering the root group, thus a slight
>>> penalization of the
>>> async workload.
>>
>> Also penalization of sync-idle workload in root group.
>>
>> The higher latencies for sync-noidle workload with-in a group will be
>> observed only if group_isolation=1 and if group has low weight. I think
>> in general this problem should be solved by bumping up the weight of the
>> group, otherwise just live with it to ensure fairness for sync-idle
>> workload in the group.
>>
>> With group_isolation=0, the problem should be less visible as root group
>> runs with max weight in the system (1000). In general I will assume that
>> one will choose system weights in such a manner so that root group does
>> not starve.
>>
>>
>>> Gui, if you observed improvements with this patch, probably you can obtain them
>>> without the starvation drawback by making it conditional to !group_isolation.
>>>
>>> BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
>>> can have just one argument, called e.g. boost_no_idle, and you pass
>>> (!group_isolation && cfqg_changed) || prio_changed.
>>>
>>
>> Because it will impact the share of sync-idle workload in root group and
>> asyn workload systemwide, I am not too keen on doing this change. I would
>> rather rely on root group being higher weight.
>>
>> So far this reset of workload order happens only if we decide to higher
>> prio class task. So if there is some RT activity happening in system, we
>> will constantly be resetting the workload order for BE class and keep on
>> servicing sync-noidle workload while starving sync-idle and async
>> workload.
>>
>> So it is probably still ok for priority class switching, but I am not too keen
>> on doing this at group switching event. This event will be too frequent if
>> there are significant number of groups in the system and we don't want to
>> starve root group sync-idle and system wide async workload.
>>
>> If somebody is not happy with latecies of sync-noidle workload, then
>> easier way to solve that issue is readjust weights of groups.
>>
>> Thanks
>> Vivek
>
> Hi Vivek, Corrado
>
> Thanks for Corrado's explanation, i have the same concern.
>
> Consider the following code,
>
> prio_changed = (cfqd->serving_prio != previous_prio);
> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> cfqd);
> count = st->count;
> /*
> * If priority didn't change, check workload expiration,
> * and that we still have other queues ready
> */
> if (!prio_changed && count &&
> !time_after(jiffies, cfqd->workload_expires))
> return;
>
> One more thing i do this change is that currently if serving_prio isn't changed,
> we'll start from where we left. But with io group introduced, cfqd->serving_prio and
> previous_prio might come from different groups. So i don't think "prio_changed" still
> makes sense in the case of group changing.
Probably it is better to just remove the priority change concept. The
code will be simpler, and the impact should be very small.
> cfqd->workload_expires also might come from
> the workload from another group when deciding whether starting from "cfqd->serving_type".
This shouldn't happen, since Vivek is saving&restoring the workload
parameters at each group change. So it is likely that the workload
will be expired here, and we compute the new workload when entering
the new group.

I have one more concern, though.
RT priority has now changed meaning. Before, an RT task would always
have priority access to the disk. Now, a BE task in a different group,
with lower weight, can steal the disk from the RT task.
A way to preserve the old meaning is to consider wheter a group has RT
tasks inside when sorting groups tree, and putting those groups at the
front.
Usually, RT tasks will be put in the root group, and this (if
group_isolation=0) will automatically make sure that also the noidle
workload gets serviced quickly after RT tasks release the disk. We
could even enforce that, with group_isolation=0, all RT tasks are put
in the root group.

The rationale behind this suggestion is that groups are for user
processes, while RT is system wide, since it is only root that can
grant it.

Thanks
Corrado
>
> Thanks,
> Gui
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-12-14 09:59:29

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Corrado Zoccolo wrote:
> Hi,
> On Mon, Dec 14, 2009 at 3:37 AM, Gui Jianfeng
> <[email protected]> wrote:
>> Vivek Goyal wrote:
>>> On Fri, Dec 11, 2009 at 07:01:14PM +0100, Corrado Zoccolo wrote:
>>>> Hi guys,
>>>> On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
>>>>> On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
>>>>>> Currently, with IO Controller introduced, CFQ chooses cfq group
>>>>>> at the top, and then choose service tree. So we need to take
>>>>>> whether cfq group is changed into account to decide whether we
>>>>>> should choose service tree start from scratch.
>>>>>>
>>>>> I am not able to understand the need/purpose of this patch. Once we
>>>>> switched the group during scheduling, why should we reset the order
>>>>> of workload with-in group.
>>>> I understand it, and in fact I was thinking about this.
>>>> The idea is the same as with priorities. If we have not serviced a group
>>>> for a while, we want to start with the no-idle workload to reduce its latency.
>>>>
>>>> Unfortunately, a group may have a too small share, that could cause some
>>>> workloads to be starved, as Vivek correctly points out, and we should
>>>> avoid this.
>>>> It should be easily reproducible testing a "many groups with mixed workloads"
>>>> scenario with group_isolation=1.
>>>>
>>>> Moreover, even if the approach is groups on top, when group isolation
>>>> is not set (i.e. the default), in the non-root groups you will only
>>>> have the sync-idle
>>>> queues, so it is much more similar (logically) to a workload on top
>>>> than it appears
>>>> from the code.
>>>>
>>>> I think the net result of this patch is, when group isolation is not
>>>> set, to select no-idle
>>>> workload first only when entering the root group, thus a slight
>>>> penalization of the
>>>> async workload.
>>> Also penalization of sync-idle workload in root group.
>>>
>>> The higher latencies for sync-noidle workload with-in a group will be
>>> observed only if group_isolation=1 and if group has low weight. I think
>>> in general this problem should be solved by bumping up the weight of the
>>> group, otherwise just live with it to ensure fairness for sync-idle
>>> workload in the group.
>>>
>>> With group_isolation=0, the problem should be less visible as root group
>>> runs with max weight in the system (1000). In general I will assume that
>>> one will choose system weights in such a manner so that root group does
>>> not starve.
>>>
>>>
>>>> Gui, if you observed improvements with this patch, probably you can obtain them
>>>> without the starvation drawback by making it conditional to !group_isolation.
>>>>
>>>> BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
>>>> can have just one argument, called e.g. boost_no_idle, and you pass
>>>> (!group_isolation && cfqg_changed) || prio_changed.
>>>>
>>> Because it will impact the share of sync-idle workload in root group and
>>> asyn workload systemwide, I am not too keen on doing this change. I would
>>> rather rely on root group being higher weight.
>>>
>>> So far this reset of workload order happens only if we decide to higher
>>> prio class task. So if there is some RT activity happening in system, we
>>> will constantly be resetting the workload order for BE class and keep on
>>> servicing sync-noidle workload while starving sync-idle and async
>>> workload.
>>>
>>> So it is probably still ok for priority class switching, but I am not too keen
>>> on doing this at group switching event. This event will be too frequent if
>>> there are significant number of groups in the system and we don't want to
>>> starve root group sync-idle and system wide async workload.
>>>
>>> If somebody is not happy with latecies of sync-noidle workload, then
>>> easier way to solve that issue is readjust weights of groups.
>>>
>>> Thanks
>>> Vivek
>> Hi Vivek, Corrado
>>
>> Thanks for Corrado's explanation, i have the same concern.
>>
>> Consider the following code,
>>
>> prio_changed = (cfqd->serving_prio != previous_prio);
>> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>> cfqd);
>> count = st->count;
>> /*
>> * If priority didn't change, check workload expiration,
>> * and that we still have other queues ready
>> */
>> if (!prio_changed && count &&
>> !time_after(jiffies, cfqd->workload_expires))
>> return;
>>
>> One more thing i do this change is that currently if serving_prio isn't changed,
>> we'll start from where we left. But with io group introduced, cfqd->serving_prio and
>> previous_prio might come from different groups. So i don't think "prio_changed" still
>> makes sense in the case of group changing.
> Probably it is better to just remove the priority change concept. The
> code will be simpler, and the impact should be very small.
>> cfqd->workload_expires also might come from
>> the workload from another group when deciding whether starting from "cfqd->serving_type".
> This shouldn't happen, since Vivek is saving&restoring the workload
> parameters at each group change. So it is likely that the workload
> will be expired here, and we compute the new workload when entering
> the new group.

Currently, IIUC, only the workload that didn't use up its slice will be saved, and only
such workloads are restoring when a group is resumed. So sometimes, we'll still get the
previous serving_type and workload_expires. Am i missing something?

>
> I have one more concern, though.
> RT priority has now changed meaning. Before, an RT task would always
> have priority access to the disk. Now, a BE task in a different group,
> with lower weight, can steal the disk from the RT task.
> A way to preserve the old meaning is to consider wheter a group has RT
> tasks inside when sorting groups tree, and putting those groups at the
> front.
> Usually, RT tasks will be put in the root group, and this (if
> group_isolation=0) will automatically make sure that also the noidle
> workload gets serviced quickly after RT tasks release the disk. We
> could even enforce that, with group_isolation=0, all RT tasks are put
> in the root group.
>
> The rationale behind this suggestion is that groups are for user
> processes, while RT is system wide, since it is only root that can
> grant it.

I agree, and one more thing, currently we can't see fairness between different
idle tasks in different groups. Because we only allow idle cfqq dispatch one request
for its dispatch round even if it's the only task in the cgroup, group always loose it
share. So whether we can rely on group_isolation, when group_isolation == 1 we provide
isolation for idle tasks.

Thanks
Gui

2009-12-14 11:01:19

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Hi Gui,
On Mon, Dec 14, 2009 at 10:54 AM, Gui Jianfeng
<[email protected]> wrote:
> Corrado Zoccolo wrote:
>> Hi,
> Currently, IIUC, only the workload that didn't use up its slice will be saved, and only
> such workloads are restoring when a group is resumed. So sometimes, we'll still get the
> previous serving_type and workload_expires. Am i missing something?
You are right. cfq_choose_cfqg should set the workload as expired if
!cfqg->saved_workload_slice (just set cfqd->workload_expires = jiffies
- 1), so the workload will be chosen again as the lowest keyed one.
Can you send a patch to fix this?
>
>
>>
>> I have one more concern, though.
>> RT priority has now changed meaning. Before, an RT task would always
>> have priority access to the disk. Now, a BE task in a different group,
>> with lower weight, can steal the disk from the RT task.
>> A way to preserve the old meaning is to consider wheter a group has RT
>> tasks inside when sorting groups tree, and putting those groups at the
>> front.
>> Usually, RT tasks will be put in the root group, and this (if
>> group_isolation=0) will automatically make sure that also the noidle
>> workload gets serviced quickly after RT tasks release the disk. We
>> could even enforce that, with group_isolation=0, all RT tasks are put
>> in the root group.
>>
>> The rationale behind this suggestion is that groups are for user
>> processes, while RT is system wide, since it is only root that can
>> grant it.
>
> I agree, and one more thing, currently we can't see fairness between different
> idle tasks in different groups. Because we only allow idle cfqq dispatch one request
> for its dispatch round even if it's the only task in the cgroup, group always loose it
> share. So whether we can rely on group_isolation, when group_isolation == 1 we provide
> isolation for idle tasks.
Agreed.

Thanks,
Corrado

>
> Thanks
> Gui
>
>
>

2009-12-15 00:52:30

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

On Mon, Dec 14, 2009 at 12:01:14PM +0100, Corrado Zoccolo wrote:
> Hi Gui,
> On Mon, Dec 14, 2009 at 10:54 AM, Gui Jianfeng
> <[email protected]> wrote:
> > Corrado Zoccolo wrote:
> >> Hi,
> > Currently, IIUC, only the workload that didn't use up its slice will be saved, and only
> > such workloads are restoring when a group is resumed. So sometimes, we'll still get the
> > previous serving_type and workload_expires. Am i missing something?
> You are right. cfq_choose_cfqg should set the workload as expired if
> !cfqg->saved_workload_slice (just set cfqd->workload_expires = jiffies
> - 1), so the workload will be chosen again as the lowest keyed one.
> Can you send a patch to fix this?
> >

I agree too. This is a bug and needs to be fixed. Declaring workload
expired if we did not save a state, sounds good. We will be forced to
choose workload with-in group again and that is the right thing to do.

Gui, the point about prio_changed across group also will be solved by
above fix. We do save/restore the state of prio being served with-in
group. The only case missed out was, what if workload slice has expired
at the time of group expiry. So above patch should fix that too.

> >
> >>
> >> I have one more concern, though.
> >> RT priority has now changed meaning. Before, an RT task would always
> >> have priority access to the disk. Now, a BE task in a different group,
> >> with lower weight, can steal the disk from the RT task.
> >> A way to preserve the old meaning is to consider wheter a group has RT
> >> tasks inside when sorting groups tree, and putting those groups at the
> >> front.
> >> Usually, RT tasks will be put in the root group, and this (if
> >> group_isolation=0) will automatically make sure that also the noidle
> >> workload gets serviced quickly after RT tasks release the disk. We
> >> could even enforce that, with group_isolation=0, all RT tasks are put
> >> in the root group.
> >>
> >> The rationale behind this suggestion is that groups are for user
> >> processes, while RT is system wide, since it is only root that can
> >> grant it.

This sounds reasonable. I was also thinking of having a separate group
service tree for RT groups. That will allow root to create RT groups
of different weights and achieve proportionate share between RT groups.
But I am not sure how useful that actually is.

So to preserve the meaning of original RT tasks systemwide, a simple
solution is to determine if a group as RT tasks and if answer is yes, then
put it at the front of service tree or at the back of currently backlogged
RT groups.

I will play a bit with this and come up with a patch.

> >
> > ?I agree, and one more thing, currently we can't see fairness between different
> > ?idle tasks in different groups. Because we only allow idle cfqq dispatch one request
> > ?for its dispatch round even if it's the only task in the cgroup, group always loose it
> > ?share. So whether we can rely on group_isolation, when group_isolation == 1 we provide
> > ?isolation for idle tasks.
> Agreed.

What's the practical usage of this? In theory it sounds fine that groups
should provide isolation for idle task also. But what's the use case for
this. In the first phase, the way RT class is system wide, why idle class
can't be more or less system wide.

So I don't mind putting a patch in for making sure idle class task in a
group does not loose share. But I would rather put that patch once
somebody really needs it. Are you being impacted with this in your
workflow? If yes, please do post a patch to fix it.

Thanks
Vivek

2009-12-15 01:11:13

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Corrado Zoccolo wrote:
> Hi Gui,
> On Mon, Dec 14, 2009 at 10:54 AM, Gui Jianfeng
> <[email protected]> wrote:
>> Corrado Zoccolo wrote:
>>> Hi,
>> Currently, IIUC, only the workload that didn't use up its slice will be saved, and only
>> such workloads are restoring when a group is resumed. So sometimes, we'll still get the
>> previous serving_type and workload_expires. Am i missing something?
> You are right. cfq_choose_cfqg should set the workload as expired if
> !cfqg->saved_workload_slice (just set cfqd->workload_expires = jiffies
> - 1), so the workload will be chosen again as the lowest keyed one.
> Can you send a patch to fix this?

Will do.

Thanks
Gui

>>
>>> I have one more concern, though.
>>> RT priority has now changed meaning. Before, an RT task would always
>>> have priority access to the disk. Now, a BE task in a different group,
>>> with lower weight, can steal the disk from the RT task.
>>> A way to preserve the old meaning is to consider wheter a group has RT
>>> tasks inside when sorting groups tree, and putting those groups at the
>>> front.
>>> Usually, RT tasks will be put in the root group, and this (if
>>> group_isolation=0) will automatically make sure that also the noidle
>>> workload gets serviced quickly after RT tasks release the disk. We
>>> could even enforce that, with group_isolation=0, all RT tasks are put
>>> in the root group.
>>>
>>> The rationale behind this suggestion is that groups are for user
>>> processes, while RT is system wide, since it is only root that can
>>> grant it.
>> I agree, and one more thing, currently we can't see fairness between different
>> idle tasks in different groups. Because we only allow idle cfqq dispatch one request
>> for its dispatch round even if it's the only task in the cgroup, group always loose it
>> share. So whether we can rely on group_isolation, when group_isolation == 1 we provide
>> isolation for idle tasks.
> Agreed.
>
> Thanks,
> Corrado
>
>> Thanks
>> Gui
>>
>>
>>
>
>
>

--
Regards
Gui Jianfeng

2009-12-15 15:23:28

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

On Mon, Dec 14, 2009 at 09:39:32AM +0100, Corrado Zoccolo wrote:
> Hi,
> On Mon, Dec 14, 2009 at 3:37 AM, Gui Jianfeng
> <[email protected]> wrote:
> > Vivek Goyal wrote:
> >> On Fri, Dec 11, 2009 at 07:01:14PM +0100, Corrado Zoccolo wrote:
> >>> Hi guys,
> >>> On Fri, Dec 11, 2009 at 4:07 PM, Vivek Goyal <[email protected]> wrote:
> >>>> On Fri, Dec 11, 2009 at 01:02:10PM +0800, Gui Jianfeng wrote:
> >>>>> Currently, with IO Controller introduced, CFQ chooses cfq group
> >>>>> at the top, and then choose service tree. So we need to take
> >>>>> whether cfq group is changed into account to decide whether we
> >>>>> should choose service tree start from scratch.
> >>>>>
> >>>> I am not able to understand the need/purpose of this patch. Once we
> >>>> switched the group during scheduling, why should we reset the order
> >>>> of workload with-in group.
> >>> I understand it, and in fact I was thinking about this.
> >>> The idea is the same as with priorities. If we have not serviced a group
> >>> for a while, we want to start with the no-idle workload to reduce its latency.
> >>>
> >>> Unfortunately, a group may have a too small share, that could cause some
> >>> workloads to be starved, as Vivek correctly points out, and we should
> >>> avoid this.
> >>> It should be easily reproducible testing a "many groups with mixed workloads"
> >>> scenario with group_isolation=1.
> >>>
> >>> Moreover, even if the approach is groups on top, when group isolation
> >>> is not set (i.e. the default), in the non-root groups you will only
> >>> have the sync-idle
> >>> queues, so it is much more similar (logically) to a workload on top
> >>> than it appears
> >>> from the code.
> >>>
> >>> I think the net result of this patch is, when group isolation is not
> >>> set, to select no-idle
> >>> workload first only when entering the root group, thus a slight
> >>> penalization of the
> >>> async workload.
> >>
> >> Also penalization of sync-idle workload in root group.
> >>
> >> The higher latencies for sync-noidle workload with-in a group will be
> >> observed only if group_isolation=1 and if group has low weight. I think
> >> in general this problem should be solved by bumping up the weight of the
> >> group, otherwise just live with it to ensure fairness for sync-idle
> >> workload in the group.
> >>
> >> With group_isolation=0, the problem should be less visible as root group
> >> runs with max weight in the system (1000). In general I will assume that
> >> one will choose system weights in such a manner so that root group does
> >> not starve.
> >>
> >>
> >>> Gui, if you observed improvements with this patch, probably you can obtain them
> >>> without the starvation drawback by making it conditional to !group_isolation.
> >>>
> >>> BTW, since you always use cfqg_changed and prio_changed OR-ed together, you
> >>> can have just one argument, called e.g. boost_no_idle, and you pass
> >>> (!group_isolation && cfqg_changed) || prio_changed.
> >>>
> >>
> >> Because it will impact the share of sync-idle workload in root group and
> >> asyn workload systemwide, I am not too keen on doing this change. I would
> >> rather rely on root group being higher weight.
> >>
> >> So far this reset of workload order happens only if we decide to higher
> >> prio class task. So if there is some RT activity happening in system, we
> >> will constantly be resetting the workload order for BE class and keep on
> >> servicing sync-noidle workload while starving sync-idle and async
> >> workload.
> >>
> >> So it is probably still ok for priority class switching, but I am not too keen
> >> on doing this at group switching event. This event will be too frequent if
> >> there are significant number of groups in the system and we don't want to
> >> starve root group sync-idle and system wide async workload.
> >>
> >> If somebody is not happy with latecies of sync-noidle workload, then
> >> easier way to solve that issue is readjust weights of groups.
> >>
> >> Thanks
> >> Vivek
> >
> > Hi Vivek, Corrado
> >
> > Thanks for Corrado's explanation, i have the same concern.
> >
> > Consider the following code,
> >
> > prio_changed = (cfqd->serving_prio != previous_prio);
> > st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> > ? ? ? ? ? ? ? ? ? ? ? ?cfqd);
> > count = st->count;
> > /*
> > ?* If priority didn't change, check workload expiration,
> > ?* and that we still have other queues ready
> > ?*/
> > if (!prio_changed && count &&
> > ? ? ?!time_after(jiffies, cfqd->workload_expires))
> > ? ? ?return;
> >
> > One more thing i do this change is that currently if serving_prio isn't changed,
> > we'll start from where we left. But with io group introduced, cfqd->serving_prio and
> > previous_prio might come from different groups. So i don't think "prio_changed" still
> > makes sense in the case of group changing.
> Probably it is better to just remove the priority change concept. The
> code will be simpler, and the impact should be very small.
> > cfqd->workload_expires also might come from
> > the workload from another group when deciding whether starting from "cfqd->serving_type".
> This shouldn't happen, since Vivek is saving&restoring the workload
> parameters at each group change. So it is likely that the workload
> will be expired here, and we compute the new workload when entering
> the new group.
>
> I have one more concern, though.
> RT priority has now changed meaning. Before, an RT task would always
> have priority access to the disk. Now, a BE task in a different group,
> with lower weight, can steal the disk from the RT task.
> A way to preserve the old meaning is to consider wheter a group has RT
> tasks inside when sorting groups tree, and putting those groups at the
> front.
> Usually, RT tasks will be put in the root group, and this (if
> group_isolation=0) will automatically make sure that also the noidle
> workload gets serviced quickly after RT tasks release the disk. We
> could even enforce that, with group_isolation=0, all RT tasks are put
> in the root group.
>
> The rationale behind this suggestion is that groups are for user
> processes, while RT is system wide, since it is only root that can
> grant it.
>

Thinking more about it...

Moving all RT tasks to root group will increase the overall share of root
group (including share of non RT workload like sync-idle, sync-noidle and
async). Because everytime, RT task does some IO, root group will be put at the
front of service tree (irrespective of the fact how much service it has
received in the past w.r.t other groups). That will make root group gain
share and in trun root group non-RT sync-idle, sync-nodile and async
workload also gain share.

Another way to solve the issue could be to have a separate service tree
and root group for RT workload. By default all the RT tasks (systemwide),
will be put into that group and we will always serve that root rt group first
and if that group does not have any request than serve the requests from
regular (BE and IDLE tasks), group service tree.

This will make sure that RT tasks system wide get full access to disk
first and then BE and IDLE tasks get to run. Also BE and IDLE tasks in
root group will not gain share.

One issue with this approach is prio_changed concept. Because now all the
RT tasks are in a seprate group altogether, there will be no concept of
prio_changed with-in group. Rest of the group will have either BE or IDLE
prio tasks only. So that would mean that I need to get rid of prio_changed
concept while selecting workload with-in group and rely on either fresh
selection of workload type based on rb_key offset or kind of force strict
round-robin between workloads of type (sync-idle, sync-noidle and async).

Does this make sense? Corrodo, do you forsee any issues if I get rid of
prio_changed concept. So if a workload has expired, we will always do
fresh selection of workload based on rb_key across service trees of
sync-idle, sync-noidle and async. This might lead to issues of sync-noidle
workload not gettting as good latency in the presence of RT tasks. May be
forcing a strict round robin between workload types will mitigate that
issue up to some extent.

Thanks
Vivek

2009-12-15 16:04:34

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

Hi Vivek,

On Tue, Dec 15, 2009 at 4:23 PM, Vivek Goyal <[email protected]> wrote:
>
> Thinking more about it...
>
> Moving all RT tasks to root group will increase the overall share of root
> group (including share of non RT workload like sync-idle, sync-noidle and
> async). Because everytime, RT task does some IO, root group will be put at the
> front of service tree (irrespective of the fact how much service it has
> received in the past w.r.t other groups). That will make root group gain
> share and in trun root group non-RT sync-idle, sync-nodile and async
> workload also gain share.
>
Yes, this can be a problem. However, it can obtain a similar effect to the
prio_changed concept (especially when group_isolation = 0). In fact, we wanted
to run sync_noidle after RT, and sync_noidle are in root group in the
uninsulated case.

> Another way to solve the issue could be to have a separate service tree
> and root group for RT workload. By default all the RT tasks (systemwide),
> will be put into that group and we will always serve that root rt group first
> and if that group does not have any request than serve the requests from
> regular (BE and IDLE tasks), group service tree.
Ok. In this way, in a group you won't have 2 arrays of service trees
(one for each priority), but just one of them.
Only the RT root group will be allowed to contain RT queues.
What about IDLE? Should we have an other root group for it, to have
system-wide idle?
In that case, the design will become more orthogonal.

>
> This will make sure that RT tasks system wide get full access to disk
> first and then BE and IDLE tasks get to run. Also BE and IDLE tasks in
> root group will not gain share.
>
> One issue with this approach is prio_changed concept. Because now all the
> RT tasks are in a seprate group altogether, there will be no concept of
> prio_changed with-in group. Rest of the group will have either BE or IDLE
> prio tasks only. So that would mean that I need to get rid of prio_changed
> concept while selecting workload with-in group and rely on either fresh
> selection of workload type based on rb_key offset or kind of force strict
> round-robin between workloads of type (sync-idle, sync-noidle and async).
>
> Does this make sense? Corrodo, do you forsee any issues if I get rid of
> prio_changed concept. So if a workload has expired, we will always do
> fresh selection of workload based on rb_key across service trees of
> sync-idle, sync-noidle and async. This might lead to issues of sync-noidle
> workload not gettting as good latency in the presence of RT tasks. May be
> forcing a strict round robin between workload types will mitigate that
> issue up to some extent.
>
I think the prio_changed concept can simply be dropped, and we can still have
lowest rb_key selection (that I think is superior to strict round robin).
I don't see any problem in this. In presence of RT, the latency will
go up anyway.

Thanks,
Corrado

> Thanks
> Vivek
>

2009-12-15 16:22:43

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH] cfq: Take whether cfq group is changed into account when choosing service tree

On Tue, Dec 15, 2009 at 05:04:31PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
>
> On Tue, Dec 15, 2009 at 4:23 PM, Vivek Goyal <[email protected]> wrote:
> >
> > Thinking more about it...
> >
> > Moving all RT tasks to root group will increase the overall share of root
> > group (including share of non RT workload like sync-idle, sync-noidle and
> > async). Because everytime, RT task does some IO, root group will be put at the
> > front of service tree (irrespective of the fact how much service it has
> > received in the past w.r.t other groups). That will make root group gain
> > share and in trun root group non-RT sync-idle, sync-nodile and async
> > workload also gain share.
> >
> Yes, this can be a problem. However, it can obtain a similar effect to the
> prio_changed concept (especially when group_isolation = 0). In fact, we wanted
> to run sync_noidle after RT, and sync_noidle are in root group in the
> uninsulated case.
>
> > Another way to solve the issue could be to have a separate service tree
> > and root group for RT workload. By default all the RT tasks (systemwide),
> > will be put into that group and we will always serve that root rt group first
> > and if that group does not have any request than serve the requests from
> > regular (BE and IDLE tasks), group service tree.
> Ok. In this way, in a group you won't have 2 arrays of service trees
> (one for each priority), but just one of them.
> Only the RT root group will be allowed to contain RT queues.

Yes, something like that. So instead of a group hosting both RT and BE
queues, group can have a type and it will host only either RT or BE
queues.

> What about IDLE? Should we have an other root group for it, to have
> system-wide idle?

My be. We can make a system wide idle group also along the lines of system
wide RT group.

So in this model, we will be doing proportional BW division only for BE
class tasks and not for RT and BE class tasks. RT and BE class tasks will
be system wide and will always go to fixed root RT or root BE groups. BE
class tasks will go in different groups based on their cgroups and will
get disk share in proportion to group weight.

> In that case, the design will become more orthogonal.
>
> >
> > This will make sure that RT tasks system wide get full access to disk
> > first and then BE and IDLE tasks get to run. Also BE and IDLE tasks in
> > root group will not gain share.
> >
> > One issue with this approach is prio_changed concept. Because now all the
> > RT tasks are in a seprate group altogether, there will be no concept of
> > prio_changed with-in group. Rest of the group will have either BE or IDLE
> > prio tasks only. So that would mean that I need to get rid of prio_changed
> > concept while selecting workload with-in group and rely on either fresh
> > selection of workload type based on rb_key offset or kind of force strict
> > round-robin between workloads of type (sync-idle, sync-noidle and async).
> >
> > Does this make sense? Corrodo, do you forsee any issues if I get rid of
> > prio_changed concept. So if a workload has expired, we will always do
> > fresh selection of workload based on rb_key across service trees of
> > sync-idle, sync-noidle and async. This might lead to issues of sync-noidle
> > workload not gettting as good latency in the presence of RT tasks. May be
> > forcing a strict round robin between workload types will mitigate that
> > issue up to some extent.
> >
> I think the prio_changed concept can simply be dropped, and we can still have
> lowest rb_key selection (that I think is superior to strict round robin).
> I don't see any problem in this. In presence of RT, the latency will
> go up anyway.

Ok, initially we can stick to lowest rb_key based selection and if that
does not give satisfactory latencies for sync-noidle workload, we can
revisit this issue.

Cool, I will write a patch and see how well does this thing work.

Thanks
Vivek