LinuxLists.cc - Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-10 08:56:25

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
[snip]
>
> Hmm, that _should_ more or less work and does indeed suggest there's
> something iffy.
>

I think we locate the reason why cpu-cgroup doesn't works well on dbench
now... finally.

I'd like to link the reproduce way of the issue here since long time
passed...

https://lkml.org/lkml/2014/5/16/4

Now here is the analysis:

So our problem is when put tasks like dbench which sleep and wakeup each other
frequently into a deep-group, they will gathered on same CPU when workload like
stress are running, which lead to that the whole group could gain no more than
one CPU.

Basically there are two key points here, load-balance and wake-affine.

Wake-affine for sure pull tasks together for workload like dbench, what make
it difference when put dbench into a group one level deeper is the
load-balance, which happened less.

Usually, when system is busy, during the wakeup when we could not locate
idle cpu, we pick the search point instead, whatever how busy it is since
we count on the balance routine later to help balance the load.

However, in our cases the load balance could not help on that, since deeper
the group is, less the load effect it means to root group.

By which means even tasks in deep group all gathered on one CPU, the load
could still balanced from the view of root group, and the tasks lost the
only chances (balance) to spread when they already on the same CPU...

Furthermore, for tasks flip frequently like dbench, it'll become far more
harder for load balance to help, it could even rarely catch them on rq.

So in such cases, the only chance to do balance for these tasks is during
the wakeup, however it will be expensive...

Thus the cheaper way is something just like select_idle_sibling(), the only
difference is now we balance tasks inside the group to prevent them from
gathered.

Below patch has solved the problem during the testing, I'd like to do more
testing on other benchmarks before send out the formal patch, any comments
are welcomed ;-)

Regards,
Michael Wang

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..e1381cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
return idlest;
}

+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+ return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ *
+ * Although gathered on same CPU and spread accross CPUs could make
+ * no difference from highest group's view, this will cause the tasks
+ * starving, even they have enough share to fight for CPU, they only
+ * got one battle filed, which means whatever how big their weight is,
+ * they totally got one CPU at maximum.
+ *
+ * Thus when system is busy, we filtered out those tasks which couldn't
+ * gain help from balance routine, and try to balance them internally
+ * by this func, so they could stand a chance to show their power.
+ *
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int i = task_cpu(p);
+ struct task_group *tg = task_group(p);
+
+ if (tg_idle_cpu(tg, target))
+ goto done;
+
+ sd = rcu_dereference(per_cpu(sd_llc, target));
+ for_each_lower_domain(sd) {
+ sg = sd->groups;
+ do {
+ if (!cpumask_intersects(sched_group_cpus(sg),
+ tsk_cpus_allowed(p)))
+ goto next;
+
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ if (i == target || !tg_idle_cpu(tg, i))
+ goto next;
+ }
+
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+
+ goto done;
+next:
+ sg = sg->next;
+ } while (sg != sd->groups);
+ }
+
+done:
+
+ return target;
+}
+
/*
* Try and locate an idle CPU in the sched_domain.
*/
@@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
+ struct sched_entity *se = task_group(p)->se[i];

if (idle_cpu(target))
return target;
@@ -4451,6 +4508,30 @@ next:
} while (sg != sd->groups);
}
done:
+
+ if (!idle_cpu(target)) {
+ /*
+ * No idle cpu located imply the system is somewhat busy,
+ * usually we count on load balance routine's help and
+ * just pick the target whatever how busy it is.
+ *
+ * However, when task belong to a deep group (harder to
+ * make root imbalance) and flip frequently (harder to be
+ * caught during balance), load balance routine could help
+ * nothing, and these tasks will eventually gathered on same
+ * cpu when they wakeup each other, that is the chance of
+ * gathered stand far more higher than the chance of spread.
+ *
+ * Thus for such tasks, we need to handle them carefully
+ * during wakeup, since it's the very rarely chance for
+ * them to spread.
+ *
+ */
+ if (se && se->depth &&
+ p->wakee_flips > this_cpu_read(sd_llc_size))
+ return tg_idle_sibling(p, target);
+ }
+
return target;
}

2014-06-10 12:12:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On Tue, Jun 10, 2014 at 04:56:12PM +0800, Michael wang wrote:
> On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
> [snip]
> >
> > Hmm, that _should_ more or less work and does indeed suggest there's
> > something iffy.
> >
>
> I think we locate the reason why cpu-cgroup doesn't works well on dbench
> now... finally.
>
> I'd like to link the reproduce way of the issue here since long time
> passed...
>
> https://lkml.org/lkml/2014/5/16/4
>
> Now here is the analysis:
>
> So our problem is when put tasks like dbench which sleep and wakeup each other
> frequently into a deep-group, they will gathered on same CPU when workload like
> stress are running, which lead to that the whole group could gain no more than
> one CPU.
>
> Basically there are two key points here, load-balance and wake-affine.
>
> Wake-affine for sure pull tasks together for workload like dbench, what make
> it difference when put dbench into a group one level deeper is the
> load-balance, which happened less.

We load-balance less (frequently) or we migrate less tasks due to
load-balancing ?

> Usually, when system is busy, during the wakeup when we could not locate
> idle cpu, we pick the search point instead, whatever how busy it is since
> we count on the balance routine later to help balance the load.

But above you said that dbench usually triggers the wake-affine logic,
but now you say it doesn't and we rely on select_idle_sibling?

Note that the comparison isn't fair, running dbench on an idle system vs
running dbench on a busy system is the first step.

The second is adding the cgroup crap on.

> However, in our cases the load balance could not help on that, since deeper
> the group is, less the load effect it means to root group.

But since all actual load is on the same depth, the relative threshold
(imbalance pct) should work the same, the size of the values don't
matter, the relative ratios do.

> By which means even tasks in deep group all gathered on one CPU, the load
> could still balanced from the view of root group, and the tasks lost the
> only chances (balance) to spread when they already on the same CPU...

Sure, but see above.

> Furthermore, for tasks flip frequently like dbench, it'll become far more
> harder for load balance to help, it could even rarely catch them on rq.

And I suspect that is the main problem; so see what it does on a busy
system: !cgroup: nr_cpus busy loops + dbench, because that's your
benchmark for adding cgroups, the cgroup can only shift that behaviour
around.

> So in such cases, the only chance to do balance for these tasks is during
> the wakeup, however it will be expensive...
>
> Thus the cheaper way is something just like select_idle_sibling(), the only
> difference is now we balance tasks inside the group to prevent them from
> gathered.
>
> Below patch has solved the problem during the testing, I'd like to do more
> testing on other benchmarks before send out the formal patch, any comments
> are welcomed ;-)

So I think that approach is wrong, select_idle_siblings() works because
we want to keep CPUs from being idle, but if they're not actually idle,
pretending like they are (in a cgroup) is actively wrong and can skew
load pretty bad.

Furthermore, if as I expect, dbench sucks on a busy system, then the
proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
alter behaviour like that.

More so, I suspect that patch will tend to overload cpu0 (and lower cpu
numbers in general -- because its scanning in the same direction for
each cgroup) for other workloads. You can't just go pile more and more
work on cpu0 just because there's nothing running in this particular
cgroup.

So dbench is very sensitive to queueing, and select_idle_siblings()
avoids a lot of queueing on an idle system. I don't think that's
something we should fix with cgroups.

Attachments:

(No filename) (3.77 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-11 06:13:57

by Michael wang

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

Hi, Peter

Thanks for the reply :)

On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
[snip]
>> Wake-affine for sure pull tasks together for workload like dbench, what make
>> it difference when put dbench into a group one level deeper is the
>> load-balance, which happened less.
>
> We load-balance less (frequently) or we migrate less tasks due to
> load-balancing ?

IMHO, when we put tasks one group deeper, in other word the totally
weight of these tasks is 1024 (prev is 3072), the load become more
balancing in root, which make bl-routine consider the system is
balanced, which make we migrate less in lb-routine.

>
>> Usually, when system is busy, during the wakeup when we could not locate
>> idle cpu, we pick the search point instead, whatever how busy it is since
>> we count on the balance routine later to help balance the load.
>
> But above you said that dbench usually triggers the wake-affine logic,
> but now you say it doesn't and we rely on select_idle_sibling?

During wakeup, it triggered wake-affine, after that, go inside
select_idle_sibling() and found no idle cpu, than pick the search point
instead (curr cpu if wake-affine or prev cpu if not).

>
> Note that the comparison isn't fair, running dbench on an idle system vs
> running dbench on a busy system is the first step.

Our comparison is based on the same busy-system, all the two cases have
the same workload running, the only difference is that we put the same
workload (dbench + stress) one group deeper, it's like:

Good case:
root
l1-A l1-B l1-C
dbench stress stress

results:
dbench got around 300%
each stress got around 450%

Bad case:
root
l1
l2-A l2-B l2-C
dbench stress stress

results:
dbench got around 100% (throughout dropped too)
each stress got around 550%

Although the l1-group gain the same resources (1200%), it doesn't assign
to l2-ABC correctly like the root-group did.

>
> The second is adding the cgroup crap on.
>
>> However, in our cases the load balance could not help on that, since deeper
>> the group is, less the load effect it means to root group.
>
> But since all actual load is on the same depth, the relative threshold
> (imbalance pct) should work the same, the size of the values don't
> matter, the relative ratios do.

Exactly, however, when group is deep, the chance of it to make root
imbalance reduced, in good case, gathered on cpu means 1024 load, while
in bad case it dropped to 1024/3 ideally, that make it harder to trigger
imbalance and gain help from the routine, please note that although
dbench and stress are the only workload in system, there are still other
tasks serve for the system need to be wakeup (some very actively since
the dbench...), compared to them, deep group load means nothing...

>
>> By which means even tasks in deep group all gathered on one CPU, the load
>> could still balanced from the view of root group, and the tasks lost the
>> only chances (balance) to spread when they already on the same CPU...
>
> Sure, but see above.

The lb-routine could not provide enough help for deep group, since the
imbalance happened inside the group could not cause imbalance in root,
ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
easily ignored, but inside the l2-group, the gathered case could already
means imbalance like (1024 * 5) : 1024.

>
>> Furthermore, for tasks flip frequently like dbench, it'll become far more
>> harder for load balance to help, it could even rarely catch them on rq.
>
> And I suspect that is the main problem; so see what it does on a busy
> system: !cgroup: nr_cpus busy loops + dbench, because that's your
> benchmark for adding cgroups, the cgroup can only shift that behaviour
> around.

There are busy loops in good case too, and dbench behaviour in l1-groups
should not changed after put them to l2-group, what make things worse is
the chance for them to spread after gathered become less.

>
[snip]
>> Below patch has solved the problem during the testing, I'd like to do more
>> testing on other benchmarks before send out the formal patch, any comments
>> are welcomed ;-)
>
> So I think that approach is wrong, select_idle_siblings() works because
> we want to keep CPUs from being idle, but if they're not actually idle,
> pretending like they are (in a cgroup) is actively wrong and can skew
> load pretty bad.

We only choose the timing when no idle cpu located, and flips is
somewhat high, also the group is deep.

In such cases, select_idle_siblings() doesn't works anyway, it return
the target even it is very busy, we just check twice to prevent it from
making some obviously bad decision ;-)

>
> Furthermore, if as I expect, dbench sucks on a busy system, then the
> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> alter behaviour like that.

That's true and that's why we currently still need to shut down the
GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
solve later...

What we currently expect is that the cgroup assign the resource
according to the share, it works well in l1-groups, so we expect it to
work the same well in l2-groups...

>
> More so, I suspect that patch will tend to overload cpu0 (and lower cpu
> numbers in general -- because its scanning in the same direction for
> each cgroup) for other workloads. You can't just go pile more and more
> work on cpu0 just because there's nothing running in this particular
> cgroup.

That's a good point...

However during the testing, this doesn't happen on the 3 groups, tasks
stay on high-cpu as often as low-cpu, IMHO the key point here is the
lb-routine still works, although much less than before.

So the fix just make the result of lb-routine effect longer, since the
higher cpu it picked is usually idle in group (directly pick later), in
other word, tasks on high-cpu is harder to be wake-affine to low-cpu
than before.

And when this apply to all the groups, each of them will be balanced
both internally and externally, then we will see equal tasks on each cpu.

select_idle_sibling() do pick low-cpu more often, and combined with
wake-affine, without enough load-balance, the tasks will gathered on
low-cpu more often, but our solution will make the less load-balance
become more valuable (when they need to be), IMHO, it could even
contribute to balance work in some cases...

>
> So dbench is very sensitive to queueing, and select_idle_siblings()
> avoids a lot of queueing on an idle system. I don't think that's
> something we should fix with cgroups.

It has to queue anyway after wakeup, isn't it? we just want a good
candidate which won't make things too bad inside group, and only do this
when select_idle_siblings() give up on searching...

Regards,
Michael Wang

>

2014-06-11 08:24:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote:
> Hi, Peter
>
> Thanks for the reply :)
>
> On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
> [snip]
> >> Wake-affine for sure pull tasks together for workload like dbench, what make
> >> it difference when put dbench into a group one level deeper is the
> >> load-balance, which happened less.
> >
> > We load-balance less (frequently) or we migrate less tasks due to
> > load-balancing ?
>
> IMHO, when we put tasks one group deeper, in other word the totally
> weight of these tasks is 1024 (prev is 3072), the load become more
> balancing in root, which make bl-routine consider the system is
> balanced, which make we migrate less in lb-routine.

But how? The absolute value (1024 vs 3072) is of no effect to the
imbalance, the imbalance is computed from relative differences between
cpus.

> Our comparison is based on the same busy-system, all the two cases have
> the same workload running, the only difference is that we put the same
> workload (dbench + stress) one group deeper, it's like:
>
> Good case:
> root
> l1-A l1-B l1-C
> dbench stress stress
>
> results:
> dbench got around 300%
> each stress got around 450%
>
> Bad case:
> root
> l1
> l2-A l2-B l2-C
> dbench stress stress
>
> results:
> dbench got around 100% (throughout dropped too)
> each stress got around 550%
>
> Although the l1-group gain the same resources (1200%), it doesn't assign
> to l2-ABC correctly like the root-group did.

But in this case select_idle_sibling() should function identially, so
that cannot be the problem.

> > The second is adding the cgroup crap on.
> >
> >> However, in our cases the load balance could not help on that, since deeper
> >> the group is, less the load effect it means to root group.
> >
> > But since all actual load is on the same depth, the relative threshold
> > (imbalance pct) should work the same, the size of the values don't
> > matter, the relative ratios do.
>
> Exactly, however, when group is deep, the chance of it to make root
> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> imbalance and gain help from the routine, please note that although
> dbench and stress are the only workload in system, there are still other
> tasks serve for the system need to be wakeup (some very actively since
> the dbench...), compared to them, deep group load means nothing...

What tasks are these? And is it their interference that disturbs
load-balancing?

> >> By which means even tasks in deep group all gathered on one CPU, the load
> >> could still balanced from the view of root group, and the tasks lost the
> >> only chances (balance) to spread when they already on the same CPU...
> >
> > Sure, but see above.
>
> The lb-routine could not provide enough help for deep group, since the
> imbalance happened inside the group could not cause imbalance in root,
> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
> easily ignored, but inside the l2-group, the gathered case could already
> means imbalance like (1024 * 5) : 1024.

your explanation is not making sense, we have 3 cgroups, so the total
root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

And again, the absolute value doesn't matter, with (istr) 12 cpus the
avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
scale.

Same with l2, total weight of 1024, giving a per task weight of ~56 and
a per-cpu weight of ~85, which is again significant.

Also, you said load-balance doesn't usually participate much because
dbench is too fast, so please make up your mind, does it or doesn't it
matter?

> > So I think that approach is wrong, select_idle_siblings() works because
> > we want to keep CPUs from being idle, but if they're not actually idle,
> > pretending like they are (in a cgroup) is actively wrong and can skew
> > load pretty bad.
>
> We only choose the timing when no idle cpu located, and flips is
> somewhat high, also the group is deep.

-enotmakingsense

> In such cases, select_idle_siblings() doesn't works anyway, it return
> the target even it is very busy, we just check twice to prevent it from
> making some obviously bad decision ;-)

-emakinglesssense

> > Furthermore, if as I expect, dbench sucks on a busy system, then the
> > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> > alter behaviour like that.
>
> That's true and that's why we currently still need to shut down the
> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
> solve later...

more confusion..

> What we currently expect is that the cgroup assign the resource
> according to the share, it works well in l1-groups, so we expect it to
> work the same well in l2-groups...

Sure, but explain why it isn't? So far you're just saying words that
don't compute.

Attachments:

(No filename) (4.84 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-11 09:18:44

by Michael wang

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
[snip]
>>
>> IMHO, when we put tasks one group deeper, in other word the totally
>> weight of these tasks is 1024 (prev is 3072), the load become more
>> balancing in root, which make bl-routine consider the system is
>> balanced, which make we migrate less in lb-routine.
>
> But how? The absolute value (1024 vs 3072) is of no effect to the
> imbalance, the imbalance is computed from relative differences between
> cpus.

Ok, forgive me for the confusion, please allow me to explain things
again, for gathered cases like:

cpu 0 cpu 1

dbench task_sys
dbench task_sys
dbench
dbench
dbench
dbench
task_sys
task_sys

task_sys is other tasks belong to root which is nice 0, so when dbench
in l1:

cpu 0 cpu 1
load 1024 + 1024*2 1024*2

3072: 2048 imbalance %150

now when they belong to l2:

cpu 0 cpu 1
load 1024/3 + 1024*2 1024*2

2389 : 2048 imbalance %116

And it could be even less during my testing...

This is just try to explain that when 'group_load : rq_load' become
lower, it's influence to 'rq_load' become lower too, and if the system
is balanced with only 'rq_load' there, it will be considered still
balanced even 'group_load' gathered on one cpu.

Please let me know if I missed something here...

>
[snip]
>>
>> Although the l1-group gain the same resources (1200%), it doesn't assign
>> to l2-ABC correctly like the root-group did.
>
> But in this case select_idle_sibling() should function identially, so
> that cannot be the problem.

Yes, it's clean, select_idle_sibling() just return curr or prev cpu in
this case.

>
[snip]
>>
>> Exactly, however, when group is deep, the chance of it to make root
>> imbalance reduced, in good case, gathered on cpu means 1024 load, while
>> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
>> imbalance and gain help from the routine, please note that although
>> dbench and stress are the only workload in system, there are still other
>> tasks serve for the system need to be wakeup (some very actively since
>> the dbench...), compared to them, deep group load means nothing...
>
> What tasks are these? And is it their interference that disturbs
> load-balancing?

These are dbench and stress with less root-load when put into l2-groups,
that make it harder to trigger root-group imbalance like in the case above.

>
>>>> By which means even tasks in deep group all gathered on one CPU, the load
>>>> could still balanced from the view of root group, and the tasks lost the
>>>> only chances (balance) to spread when they already on the same CPU...
>>>
>>> Sure, but see above.
>>
>> The lb-routine could not provide enough help for deep group, since the
>> imbalance happened inside the group could not cause imbalance in root,
>> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
>> easily ignored, but inside the l2-group, the gathered case could already
>> means imbalance like (1024 * 5) : 1024.
>
> your explanation is not making sense, we have 3 cgroups, so the total
> root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

I mean the l2-groups case here... since l1 share is 1024, the total load
of l2-groups will be 1024 by theory.

>
> And again, the absolute value doesn't matter, with (istr) 12 cpus the
> avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
> scale.
>
> Same with l2, total weight of 1024, giving a per task weight of ~56 and
> a per-cpu weight of ~85, which is again significant.

We have other tasks which has to running in the system, in order to
serve dbench and others, and that also the case in real world, dbench
and stress are not the only tasks on rq time to time.

May be we could focus on the case above and see if it could make things
more clear firstly?

Regards,
Michael Wang

>
> Also, you said load-balance doesn't usually participate much because
> dbench is too fast, so please make up your mind, does it or doesn't it
> matter?
>
>>> So I think that approach is wrong, select_idle_siblings() works because
>>> we want to keep CPUs from being idle, but if they're not actually idle,
>>> pretending like they are (in a cgroup) is actively wrong and can skew
>>> load pretty bad.
>>
>> We only choose the timing when no idle cpu located, and flips is
>> somewhat high, also the group is deep.
>
> -enotmakingsense
>
>> In such cases, select_idle_siblings() doesn't works anyway, it return
>> the target even it is very busy, we just check twice to prevent it from
>> making some obviously bad decision ;-)
>
> -emakinglesssense
>
>>> Furthermore, if as I expect, dbench sucks on a busy system, then the
>>> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
>>> alter behaviour like that.
>>
>> That's true and that's why we currently still need to shut down the
>> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
>> solve later...
>
> more confusion..
>
>> What we currently expect is that the cgroup assign the resource
>> according to the share, it works well in l1-groups, so we expect it to
>> work the same well in l2-groups...
>
> Sure, but explain why it isn't? So far you're just saying words that
> don't compute.
>

2014-06-23 09:42:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On Wed, Jun 11, 2014 at 05:18:29PM +0800, Michael wang wrote:
> On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> IMHO, when we put tasks one group deeper, in other word the totally
> >> weight of these tasks is 1024 (prev is 3072), the load become more
> >> balancing in root, which make bl-routine consider the system is
> >> balanced, which make we migrate less in lb-routine.
> >
> > But how? The absolute value (1024 vs 3072) is of no effect to the
> > imbalance, the imbalance is computed from relative differences between
> > cpus.
>
> Ok, forgive me for the confusion, please allow me to explain things
> again, for gathered cases like:
>
> cpu 0 cpu 1
>
> dbench task_sys
> dbench task_sys
> dbench
> dbench
> dbench
> dbench
> task_sys
> task_sys

It might help if you prefix each task with the cgroup they're in; but I
think I get it, its like:

cpu0

A/dbench
A/dbench
A/dbench
A/dbench
A/dbench
A/dbench
/task_sys
/task_sys

> task_sys is other tasks belong to root which is nice 0, so when dbench
> in l1:
>
> cpu 0 cpu 1
> load 1024 + 1024*2 1024*2
>
> 3072: 2048 imbalance %150
>
> now when they belong to l2:

That would be:

cpu0

A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
/task_sys
/task_sys

Right?

> cpu 0 cpu 1
> load 1024/3 + 1024*2 1024*2
>
> 2389 : 2048 imbalance %116

Which should still end up with 3072, because A is still 1024 in total,
and all its member tasks run on the one CPU.

> And it could be even less during my testing...

Well, yes, up to 1024/nr_cpus I imagine.

> This is just try to explain that when 'group_load : rq_load' become
> lower, it's influence to 'rq_load' become lower too, and if the system
> is balanced with only 'rq_load' there, it will be considered still
> balanced even 'group_load' gathered on one cpu.
>
> Please let me know if I missed something here...

Yeah, what other tasks are these task_sys things? workqueue crap?

> >> Exactly, however, when group is deep, the chance of it to make root
> >> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> >> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> >> imbalance and gain help from the routine, please note that although
> >> dbench and stress are the only workload in system, there are still other
> >> tasks serve for the system need to be wakeup (some very actively since
> >> the dbench...), compared to them, deep group load means nothing...
> >
> > What tasks are these? And is it their interference that disturbs
> > load-balancing?
>
> These are dbench and stress with less root-load when put into l2-groups,
> that make it harder to trigger root-group imbalance like in the case above.

You're still not making sense here.. without the task_sys thingies in
you get something like:

cpu0 cpu1

A/dbench A/dbench
B/stress B/stress

And the total loads are: 512+512 vs 512+512.

> > Same with l2, total weight of 1024, giving a per task weight of ~56 and
> > a per-cpu weight of ~85, which is again significant.
>
> We have other tasks which has to running in the system, in order to
> serve dbench and others, and that also the case in real world, dbench
> and stress are not the only tasks on rq time to time.
>
> May be we could focus on the case above and see if it could make things
> more clear firstly?

Well, this all smells like you need some cgroup affinity for whatever
system tasks are running. Not fuck up the scheduler for no sane reason.

2014-06-24 03:10:33

by Michael wang

[permalink] [raw]

Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

Hi, Peter

Thanks for the reply :)

On 06/23/2014 05:42 PM, Peter Zijlstra wrote:
[snip]
>>
>> cpu 0 cpu 1
>>
>> dbench task_sys
>> dbench task_sys
>> dbench
>> dbench
>> dbench
>> dbench
>> task_sys
>> task_sys
>
> It might help if you prefix each task with the cgroup they're in;

My bad...

but I
> think I get it, its like:
>
> cpu0
>
> A/dbench
> A/dbench
> A/dbench
> A/dbench
> A/dbench
> A/dbench
> /task_sys
> /task_sys

Yeah, it's like that.

>
[snip]
>
> cpu0
>
> A/B/dbench
> A/B/dbench
> A/B/dbench
> A/B/dbench
> A/B/dbench
> A/B/dbench
> /task_sys
> /task_sys
>
> Right?

My bad to missed the group symbol here... it's actually like:

cpu0

/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/task_sys
/task_sys

And we also have six:

/l1/B/stress

and six:

/l1/C/stress

running in system.

A, B, C is the child groups of l1.

>
>> cpu 0 cpu 1
>> load 1024/3 + 1024*2 1024*2
>>
>> 2389 : 2048 imbalance %116
>
> Which should still end up with 3072, because A is still 1024 in total,
> and all its member tasks run on the one CPU.

l1 have 3 child groups, each got 6 NICE 0 tasks, so ideally each task
will got 1024/18, 6 dbench will means (1024/18)*6 == 1024/3.

Previously each of the 3 group got 1024 shares, now they need to share
1024 shares, it will become less for each of them.

>
>> And it could be even less during my testing...
>
> Well, yes, up to 1024/nr_cpus I imagine.
>
>> This is just try to explain that when 'group_load : rq_load' become
>> lower, it's influence to 'rq_load' become lower too, and if the system
>> is balanced with only 'rq_load' there, it will be considered still
>> balanced even 'group_load' gathered on one cpu.
>>
>> Please let me know if I missed something here...
>
> Yeah, what other tasks are these task_sys things? workqueue crap?

There are some other tasks but mostly showup are the kworkers, yes the
workqueue stuff.

They rapidly showup on each CPU, in some period if they showup too much,
they will eat some CPU% too, but not very much.

>
[snip]
>>
>> These are dbench and stress with less root-load when put into l2-groups,
>> that make it harder to trigger root-group imbalance like in the case above.
>
> You're still not making sense here.. without the task_sys thingies in
> you get something like:
>
> cpu0 cpu1
>
> A/dbench A/dbench
> B/stress B/stress
>
> And the total loads are: 512+512 vs 512+512.

Without other task's influence, I believe the balance should be fine,
but in our cases, at least these kworkers will join the battle anyway...

>
>>> Same with l2, total weight of 1024, giving a per task weight of ~56 and
>>> a per-cpu weight of ~85, which is again significant.
>>
>> We have other tasks which has to running in the system, in order to
>> serve dbench and others, and that also the case in real world, dbench
>> and stress are not the only tasks on rq time to time.
>>
>> May be we could focus on the case above and see if it could make things
>> more clear firstly?
>
> Well, this all smells like you need some cgroup affinity for whatever
> system tasks are running. Not fuck up the scheduler for no sane reason.

These kworkers are bind to their CPU already, I don't know how to handle
them to prevent the issue, they just keep working on their CPU, and
whenever they showup, dbench spreading inactively...

We just want a way which could help workload like dbench to work
normally with cpu-group when there are stress likely workload running in
the system.

We want dbench to gain more CPU% but cpu-shares doesn't work as
expected... dbench can get no more than 100% whatever how big it's
group's shares is, and we consider that cpu-group was broken in this
cases...

I agree that this is not a generic requirement and scheduler should only
be responsible for general situation, but since it's really a too big
regression, could we at least provide some way to stop the damage? After
all, most of the cpu-group logic is insider scheduler...

I'd like to list some real numbers in patch-thread, we really desired
for some way to make cpu-group perform normally on workload like dbench,
actually we also find some transaction workloads suffered from this
issue too, in such cases, cpu-group just failed on managing the CPU
resources...

Regards,
Michael Wang

>