2014-06-18 04:50:30

by Michael wang

[permalink] [raw]
Subject: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

By testing we found that after put benchmark (dbench) in to deep cpu-group,
tasks (dbench routines) start to gathered on one CPU, which lead to that the
benchmark could only get around 100% CPU whatever how big it's task-group's
share is, here is the link of the way to reproduce the issue:

https://lkml.org/lkml/2014/5/16/4

Please note that our comparison was based on the same workload, the only
difference is we put the workload one level deeper, and dbench could only
got 1/3 of the CPU% it used to have, the throughput dropped to half.

The dbench got less CPU since all it's instances start to gathering on the
same CPU more often than before, and in such cases, whatever how big their
share is, only one CPU they could occupy.

This is caused by that when dbench is in deep-group, the balance between
it's gathering speed (depends on wake-affine) and spreading speed (depends
on load-balance) was broken, that is more gathering chances while less
spreading chances.

Since after put dbench into deep group, it's representive load in root-group
become less, which make it harder to break the load balance of system, this
is a comparison between dbench root-load and system-tasks (besides dbench)
load, for eg:

sg0 sg1
cpu0 cpu1 cpu2 cpu3

kworker/0:0 kworker/1:0 kworker/2:0 kworker/3:0
kworker/0:1 kworker/1:1 kworker/2:1 kworker/3:1
dbench
dbench
dbench
dbench
dbench
dbench

Here without dbench, the load between sg is already balanced, which is:

4096:4096

When dbench is in one of the three cpu-cgroups on level 1, it's root-load
is 1024/6, so we have:

sg0
4096 + 6 * (1024 / 6)
sg1
4096

sg0 : sg1 == 5120 : 4096 == 125%

bigger than imbalance-pct (117% for eg), dbench spread to sg1


When dbench is in one of the three cpu-cgroups on level 2, it's root-load
is 1024/18, now we have:

sg0
4096 + 6 * (1024 / 18)
sg1
4096

sg0 : sg1 ~= 4437 : 4096 ~= 108%

smaller than imbalance-pct (same the 117%), dbench keep gathering in sg0

Thus load-balance routine become inactively on spreading dbench to other CPU,
and it's routine keep gathering on CPU more longer than before.

This patch try to select 'idle' cfs_rq inside task's cpu-group when there is no
idle CPU located by select_idle_sibling(), instead of return the 'target'
arbitrarily, this recheck help us to reserve the effect of load-balance longer,
and help to make the system more balance.

Like in the example above, the fix now will make things as:
1. dbench instances will be 'balanced' inside tg, ideally each cpu will
have one instance.
2. if 1 do make the load become imbalance, load-balance routine will do
it's job and move instances to proper CPU.
3. after 2 was done, the target CPU will always be preferred as long as
it only got one instance.

Although for tasks like dbench, 2 is rarely happened, while combined with 3, we
will finally locate a good CPU for each instance which make both internal and
external balanced.

After applied this patch, the behaviour of dbench in deep cpu-group become
normal, the dbench throughput was back.

Tested benchmarks like ebizzy, kbench, dbench on X86 12-CPU server, the patch
works well and no regression showup.

Highlight:
With out a fix, any similar workload like dbench will face the same
issue that the cpu-cgroup share lost it's effect

This may not just be an issue of cgroup, whenever we have tasks which
with small-load, play quick flip on each other, they may gathering.

Please let me know if you have any questions on whatever the issue or the fix,
comments are welcomed ;-)

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..e1381cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
return idlest;
}

+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+ return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ *
+ * Although gathered on same CPU and spread accross CPUs could make
+ * no difference from highest group's view, this will cause the tasks
+ * starving, even they have enough share to fight for CPU, they only
+ * got one battle filed, which means whatever how big their weight is,
+ * they totally got one CPU at maximum.
+ *
+ * Thus when system is busy, we filtered out those tasks which couldn't
+ * gain help from balance routine, and try to balance them internally
+ * by this func, so they could stand a chance to show their power.
+ *
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int i = task_cpu(p);
+ struct task_group *tg = task_group(p);
+
+ if (tg_idle_cpu(tg, target))
+ goto done;
+
+ sd = rcu_dereference(per_cpu(sd_llc, target));
+ for_each_lower_domain(sd) {
+ sg = sd->groups;
+ do {
+ if (!cpumask_intersects(sched_group_cpus(sg),
+ tsk_cpus_allowed(p)))
+ goto next;
+
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ if (i == target || !tg_idle_cpu(tg, i))
+ goto next;
+ }
+
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+
+ goto done;
+next:
+ sg = sg->next;
+ } while (sg != sd->groups);
+ }
+
+done:
+
+ return target;
+}
+
/*
* Try and locate an idle CPU in the sched_domain.
*/
@@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
+ struct sched_entity *se = task_group(p)->se[i];

if (idle_cpu(target))
return target;
@@ -4451,6 +4508,30 @@ next:
} while (sg != sd->groups);
}
done:
+
+ if (!idle_cpu(target)) {
+ /*
+ * No idle cpu located imply the system is somewhat busy,
+ * usually we count on load balance routine's help and
+ * just pick the target whatever how busy it is.
+ *
+ * However, when task belong to a deep group (harder to
+ * make root imbalance) and flip frequently (harder to be
+ * caught during balance), load balance routine could help
+ * nothing, and these tasks will eventually gathered on same
+ * cpu when they wakeup each other, that is the chance of
+ * gathered stand far more higher than the chance of spread.
+ *
+ * Thus for such tasks, we need to handle them carefully
+ * during wakeup, since it's the very rarely chance for
+ * them to spread.
+ *
+ */
+ if (se && se->depth &&
+ p->wakee_flips > this_cpu_read(sd_llc_size))
+ return tg_idle_sibling(p, target);
+ }
+
return target;
}

--
1.7.9.5


2014-06-23 09:42:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Wed, Jun 18, 2014 at 12:50:17PM +0800, Michael wang wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..e1381cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline int tg_idle_cpu(struct task_group *tg, int cpu)
> +{
> + return !tg->cfs_rq[cpu]->nr_running;
> +}
> +
> +/*
> + * Try and locate an idle CPU in the sched_domain from tg's view.
> + *
> + * Although gathered on same CPU and spread accross CPUs could make
> + * no difference from highest group's view, this will cause the tasks
> + * starving, even they have enough share to fight for CPU, they only
> + * got one battle filed, which means whatever how big their weight is,
> + * they totally got one CPU at maximum.
> + *
> + * Thus when system is busy, we filtered out those tasks which couldn't
> + * gain help from balance routine, and try to balance them internally
> + * by this func, so they could stand a chance to show their power.
> + *
> + */
> +static int tg_idle_sibling(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg;
> + int i = task_cpu(p);
> + struct task_group *tg = task_group(p);
> +
> + if (tg_idle_cpu(tg, target))
> + goto done;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, target));
> + for_each_lower_domain(sd) {
> + sg = sd->groups;
> + do {
> + if (!cpumask_intersects(sched_group_cpus(sg),
> + tsk_cpus_allowed(p)))
> + goto next;
> +
> + for_each_cpu(i, sched_group_cpus(sg)) {
> + if (i == target || !tg_idle_cpu(tg, i))
> + goto next;
> + }
> +
> + target = cpumask_first_and(sched_group_cpus(sg),
> + tsk_cpus_allowed(p));
> +
> + goto done;
> +next:
> + sg = sg->next;
> + } while (sg != sd->groups);
> + }
> +
> +done:
> +
> + return target;
> +}

Still completely hate this, it doesn't make sense conceptual sense what
so ever.

2014-06-24 03:35:08

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 06/23/2014 05:42 PM, Peter Zijlstra wrote:
[snip]
>> +}
>
> Still completely hate this, it doesn't make sense conceptual sense what
> so ever.

Yeah... and now I agree your opinion that this could not address all the
cases after all the testing these days...

Just wondering could we make this another scheduler feature?

I mean by logical, this will make tasks spread on each CPU inside
task-group, meanwhile follow the load-balance decision, the testing show
that the patch achieved that goal well.

Currently the scheduler haven't provide a good way to achieve that, correct?

And it do help a lot in our testing for workload like dbench and
transaction workload when they are fighting with stress likely workload,
combined with GENTLE_FAIR_SLEEPERS, we could make the cpu-shares works
again, here is some real numbers of 'dbench 6 -t 60' in our testing:

Without the patch:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1281241 0.036 62.872
Close 941274 0.002 13.298
Rename 54249 0.120 19.340
Unlink 258686 0.156 37.155
Deltree 36 8.514 41.904
Mkdir 18 0.003 0.003
Qpathinfo 1161327 0.016 40.130
Qfileinfo 203648 0.001 7.118
Qfsinfo 212896 0.004 11.084
Sfileinfo 104385 0.067 55.990
Find 448958 0.033 23.150
WriteX 639464 0.069 55.452
ReadX 2008086 0.009 24.466
LockX 4174 0.012 14.127
UnlockX 4174 0.006 7.357
Flush 89787 1.533 56.925

Throughput 666.318 MB/sec 6 clients 6 procs max_latency=62.875 ms


With the patch applied:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 2601876 0.025 52.339
Close 1911248 0.001 0.133
Rename 110195 0.080 6.739
Unlink 525476 0.070 52.359
Deltree 62 6.143 19.919
Mkdir 31 0.003 0.003
Qpathinfo 2358482 0.009 52.355
Qfileinfo 413190 0.001 0.092
Qfsinfo 432513 0.003 0.790
Sfileinfo 211934 0.027 13.830
Find 911874 0.021 5.969
WriteX 1296646 0.038 52.348
ReadX 4079453 0.006 52.247
LockX 8476 0.003 0.050
UnlockX 8476 0.001 0.045
Flush 182342 0.536 55.953

Throughput 1360.74 MB/sec 6 clients 6 procs max_latency=55.970 ms

And the share works normally, the CPU% resources was managed well again.

So could we provide a feature like:

SCHED_FEAT(TG_INTERNAL_BALANCE, false)

I do believe there are more cases could benefit from it, for those who
don't want too many wake-affine and want group-tasks more balanced on
each CPU, scheduler could provide this as an option then, shall we?

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-06-30 07:36:54

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 06/18/2014 12:50 PM, Michael wang wrote:
> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> benchmark could only get around 100% CPU whatever how big it's task-group's
> share is, here is the link of the way to reproduce the issue:

Hi, Peter

We thought that involved too much factors will make things too
complicated, we are trying to start over and get rid of the concepts of
'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
make things more easier...

Let's put down the prev-discussions, for now we just want to proposal a
cpu-group feature which could help dbench to gain enough CPU while
stress is running, in a gentle way which hasn't yet been provided by
current scheduler.

I'll post a new patch on that later, we're looking forward your comments
on it :)

Regards,
Michael Wang

>
> https://lkml.org/lkml/2014/5/16/4
>
> Please note that our comparison was based on the same workload, the only
> difference is we put the workload one level deeper, and dbench could only
> got 1/3 of the CPU% it used to have, the throughput dropped to half.
>
> The dbench got less CPU since all it's instances start to gathering on the
> same CPU more often than before, and in such cases, whatever how big their
> share is, only one CPU they could occupy.
>
> This is caused by that when dbench is in deep-group, the balance between
> it's gathering speed (depends on wake-affine) and spreading speed (depends
> on load-balance) was broken, that is more gathering chances while less
> spreading chances.
>
> Since after put dbench into deep group, it's representive load in root-group
> become less, which make it harder to break the load balance of system, this
> is a comparison between dbench root-load and system-tasks (besides dbench)
> load, for eg:
>
> sg0 sg1
> cpu0 cpu1 cpu2 cpu3
>
> kworker/0:0 kworker/1:0 kworker/2:0 kworker/3:0
> kworker/0:1 kworker/1:1 kworker/2:1 kworker/3:1
> dbench
> dbench
> dbench
> dbench
> dbench
> dbench
>
> Here without dbench, the load between sg is already balanced, which is:
>
> 4096:4096
>
> When dbench is in one of the three cpu-cgroups on level 1, it's root-load
> is 1024/6, so we have:
>
> sg0
> 4096 + 6 * (1024 / 6)
> sg1
> 4096
>
> sg0 : sg1 == 5120 : 4096 == 125%
>
> bigger than imbalance-pct (117% for eg), dbench spread to sg1
>
>
> When dbench is in one of the three cpu-cgroups on level 2, it's root-load
> is 1024/18, now we have:
>
> sg0
> 4096 + 6 * (1024 / 18)
> sg1
> 4096
>
> sg0 : sg1 ~= 4437 : 4096 ~= 108%
>
> smaller than imbalance-pct (same the 117%), dbench keep gathering in sg0
>
> Thus load-balance routine become inactively on spreading dbench to other CPU,
> and it's routine keep gathering on CPU more longer than before.
>
> This patch try to select 'idle' cfs_rq inside task's cpu-group when there is no
> idle CPU located by select_idle_sibling(), instead of return the 'target'
> arbitrarily, this recheck help us to reserve the effect of load-balance longer,
> and help to make the system more balance.
>
> Like in the example above, the fix now will make things as:
> 1. dbench instances will be 'balanced' inside tg, ideally each cpu will
> have one instance.
> 2. if 1 do make the load become imbalance, load-balance routine will do
> it's job and move instances to proper CPU.
> 3. after 2 was done, the target CPU will always be preferred as long as
> it only got one instance.
>
> Although for tasks like dbench, 2 is rarely happened, while combined with 3, we
> will finally locate a good CPU for each instance which make both internal and
> external balanced.
>
> After applied this patch, the behaviour of dbench in deep cpu-group become
> normal, the dbench throughput was back.
>
> Tested benchmarks like ebizzy, kbench, dbench on X86 12-CPU server, the patch
> works well and no regression showup.
>
> Highlight:
> With out a fix, any similar workload like dbench will face the same
> issue that the cpu-cgroup share lost it's effect
>
> This may not just be an issue of cgroup, whenever we have tasks which
> with small-load, play quick flip on each other, they may gathering.
>
> Please let me know if you have any questions on whatever the issue or the fix,
> comments are welcomed ;-)
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 81 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..e1381cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline int tg_idle_cpu(struct task_group *tg, int cpu)
> +{
> + return !tg->cfs_rq[cpu]->nr_running;
> +}
> +
> +/*
> + * Try and locate an idle CPU in the sched_domain from tg's view.
> + *
> + * Although gathered on same CPU and spread accross CPUs could make
> + * no difference from highest group's view, this will cause the tasks
> + * starving, even they have enough share to fight for CPU, they only
> + * got one battle filed, which means whatever how big their weight is,
> + * they totally got one CPU at maximum.
> + *
> + * Thus when system is busy, we filtered out those tasks which couldn't
> + * gain help from balance routine, and try to balance them internally
> + * by this func, so they could stand a chance to show their power.
> + *
> + */
> +static int tg_idle_sibling(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg;
> + int i = task_cpu(p);
> + struct task_group *tg = task_group(p);
> +
> + if (tg_idle_cpu(tg, target))
> + goto done;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, target));
> + for_each_lower_domain(sd) {
> + sg = sd->groups;
> + do {
> + if (!cpumask_intersects(sched_group_cpus(sg),
> + tsk_cpus_allowed(p)))
> + goto next;
> +
> + for_each_cpu(i, sched_group_cpus(sg)) {
> + if (i == target || !tg_idle_cpu(tg, i))
> + goto next;
> + }
> +
> + target = cpumask_first_and(sched_group_cpus(sg),
> + tsk_cpus_allowed(p));
> +
> + goto done;
> +next:
> + sg = sg->next;
> + } while (sg != sd->groups);
> + }
> +
> +done:
> +
> + return target;
> +}
> +
> /*
> * Try and locate an idle CPU in the sched_domain.
> */
> @@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
> struct sched_domain *sd;
> struct sched_group *sg;
> int i = task_cpu(p);
> + struct sched_entity *se = task_group(p)->se[i];
>
> if (idle_cpu(target))
> return target;
> @@ -4451,6 +4508,30 @@ next:
> } while (sg != sd->groups);
> }
> done:
> +
> + if (!idle_cpu(target)) {
> + /*
> + * No idle cpu located imply the system is somewhat busy,
> + * usually we count on load balance routine's help and
> + * just pick the target whatever how busy it is.
> + *
> + * However, when task belong to a deep group (harder to
> + * make root imbalance) and flip frequently (harder to be
> + * caught during balance), load balance routine could help
> + * nothing, and these tasks will eventually gathered on same
> + * cpu when they wakeup each other, that is the chance of
> + * gathered stand far more higher than the chance of spread.
> + *
> + * Thus for such tasks, we need to handle them carefully
> + * during wakeup, since it's the very rarely chance for
> + * them to spread.
> + *
> + */
> + if (se && se->depth &&
> + p->wakee_flips > this_cpu_read(sd_llc_size))
> + return tg_idle_sibling(p, target);
> + }
> +
> return target;
> }
>
>

2014-06-30 08:06:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Mon, 2014-06-30 at 15:36 +0800, Michael wang wrote:
> On 06/18/2014 12:50 PM, Michael wang wrote:
> > By testing we found that after put benchmark (dbench) in to deep cpu-group,
> > tasks (dbench routines) start to gathered on one CPU, which lead to that the
> > benchmark could only get around 100% CPU whatever how big it's task-group's
> > share is, here is the link of the way to reproduce the issue:
>
> Hi, Peter
>
> We thought that involved too much factors will make things too
> complicated, we are trying to start over and get rid of the concepts of
> 'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
> make things more easier...

While you're getting rid of the concept of 'GENTLE_FAIR_SLEEPERS', don't
forget to also get rid of the concept of 'over-scheduling' :)

That gentle thing isn't perfect (is the enemy of good), but preemption
model being based upon sleep, while nice and simple, has the unfortunate
weakness that as contention increases, so does the quantity of sleep in
the system. Would be nice to come up with an alternative preemption
model as dirt simple as this one, but lacking the inherent weakness.

-Mike

2014-06-30 08:47:32

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

Hi, Mike :)

On 06/30/2014 04:06 PM, Mike Galbraith wrote:
> On Mon, 2014-06-30 at 15:36 +0800, Michael wang wrote:
>> On 06/18/2014 12:50 PM, Michael wang wrote:
>>> By testing we found that after put benchmark (dbench) in to deep cpu-group,
>>> tasks (dbench routines) start to gathered on one CPU, which lead to that the
>>> benchmark could only get around 100% CPU whatever how big it's task-group's
>>> share is, here is the link of the way to reproduce the issue:
>>
>> Hi, Peter
>>
>> We thought that involved too much factors will make things too
>> complicated, we are trying to start over and get rid of the concepts of
>> 'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
>> make things more easier...
>
> While you're getting rid of the concept of 'GENTLE_FAIR_SLEEPERS', don't
> forget to also get rid of the concept of 'over-scheduling' :)

I'm new to this word... could you give more details on that?

>
> That gentle thing isn't perfect (is the enemy of good), but preemption
> model being based upon sleep, while nice and simple, has the unfortunate
> weakness that as contention increases, so does the quantity of sleep in
> the system. Would be nice to come up with an alternative preemption
> model as dirt simple as this one, but lacking the inherent weakness.

The preemtion based on vruntime sounds fair enough, but vruntime-bonus
for wakee do need few more thinking... although I don't want to count
the gentle-stuff in any more, but disable it do help dbench a lot...

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-06-30 09:27:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Mon, 2014-06-30 at 16:47 +0800, Michael wang wrote:
> Hi, Mike :)
>
> On 06/30/2014 04:06 PM, Mike Galbraith wrote:
> > On Mon, 2014-06-30 at 15:36 +0800, Michael wang wrote:
> >> On 06/18/2014 12:50 PM, Michael wang wrote:
> >>> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> >>> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> >>> benchmark could only get around 100% CPU whatever how big it's task-group's
> >>> share is, here is the link of the way to reproduce the issue:
> >>
> >> Hi, Peter
> >>
> >> We thought that involved too much factors will make things too
> >> complicated, we are trying to start over and get rid of the concepts of
> >> 'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
> >> make things more easier...
> >
> > While you're getting rid of the concept of 'GENTLE_FAIR_SLEEPERS', don't
> > forget to also get rid of the concept of 'over-scheduling' :)
>
> I'm new to this word... could you give more details on that?

Massive context switching. When heavily overloaded, wakeup preemption
tends to hurt. Trouble being that when overloaded, that's when
fast/light tasks also need to get in and back out quickly the most.

> > That gentle thing isn't perfect (is the enemy of good), but preemption
> > model being based upon sleep, while nice and simple, has the unfortunate
> > weakness that as contention increases, so does the quantity of sleep in
> > the system. Would be nice to come up with an alternative preemption
> > model as dirt simple as this one, but lacking the inherent weakness.
>
> The preemtion based on vruntime sounds fair enough, but vruntime-bonus
> for wakee do need few more thinking... although I don't want to count
> the gentle-stuff in any more, but disable it do help dbench a lot...

It's scaled, but that's not really enough. Zillion tasks can sleep in
parallel, and when they are doing that, sleep time becomes a rather
meaningless preemption yardstick. It's only meaningful when there is a
significant delta between task behaviors. When running a homogeneous
load of sleepers, eg a zillion java threads all doing the same damn
thing, you're better off turning wakeup preemption off, because trying
to smooth out microscopic vruntime deltas via wakeup preemption then
does nothing but trashes caches.

-Mike

2014-07-01 02:58:05

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 06/30/2014 05:27 PM, Mike Galbraith wrote:
> On Mon, 2014-06-30 at 16:47 +0800, Michael wang wrote:
[snip]
>>> While you're getting rid of the concept of 'GENTLE_FAIR_SLEEPERS', don't
>>> forget to also get rid of the concept of 'over-scheduling' :)
>>
>> I'm new to this word... could you give more details on that?
>
> Massive context switching. When heavily overloaded, wakeup preemption
> tends to hurt. Trouble being that when overloaded, that's when
> fast/light tasks also need to get in and back out quickly the most.

That's true... but for those who sensitive to latency, more frequently
the preemption is, more quickly they could have chances to run, although
that's really a very small piece of slice, but some time they just need
so much... like the mouse scurrying.

>
[snip]
>> The preemtion based on vruntime sounds fair enough, but vruntime-bonus
>> for wakee do need few more thinking... although I don't want to count
>> the gentle-stuff in any more, but disable it do help dbench a lot...
>
> It's scaled, but that's not really enough. Zillion tasks can sleep in
> parallel, and when they are doing that, sleep time becomes a rather
> meaningless preemption yardstick. It's only meaningful when there is a
> significant delta between task behaviors. When running a homogeneous
> load of sleepers, eg a zillion java threads all doing the same damn
> thing, you're better off turning wakeup preemption off, because trying
> to smooth out microscopic vruntime deltas via wakeup preemption then
> does nothing but trashes caches.

I see, for those who prefer throughput, the effort on latency is
meaningless...

IMHO, currently the generic scheduler just try to take care both latency
and throughput, both will take a little damage but won't be damaged too
much, they just sacrificed for each other...

Fortunately, we still have various knobs and features to custom our own
scheduler, The flash or The Hulk ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-01 05:41:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Tue, 2014-07-01 at 10:57 +0800, Michael wang wrote:

> IMHO, currently the generic scheduler just try to take care both latency
> and throughput, both will take a little damage but won't be damaged too
> much, they just sacrificed for each other...

The cost can be large, so it's worth an occasional ponder.

-Mike

2014-07-01 06:10:17

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/01/2014 01:41 PM, Mike Galbraith wrote:
> On Tue, 2014-07-01 at 10:57 +0800, Michael wang wrote:
>
>> IMHO, currently the generic scheduler just try to take care both latency
>> and throughput, both will take a little damage but won't be damaged too
>> much, they just sacrificed for each other...
>
> The cost can be large, so it's worth an occasional ponder.

Sure, if there are some useless efforts there, we'll locate them anyway,
sooner or later ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-01 08:20:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Tue, Jun 24, 2014 at 11:34:54AM +0800, Michael wang wrote:
> On 06/23/2014 05:42 PM, Peter Zijlstra wrote:
> [snip]
> >> +}
> >
> > Still completely hate this, it doesn't make sense conceptual sense what
> > so ever.
>
> Yeah... and now I agree your opinion that this could not address all the
> cases after all the testing these days...
>
> Just wondering could we make this another scheduler feature?

No; sched_feat() is for debugging, BIG CLUE: its guarded by
CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
broken.

If people are using it, I should remove or at least randomize the
interface.


Attachments:
(No filename) (627.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-01 08:39:12

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/01/2014 04:20 PM, Peter Zijlstra wrote:
[snip]
>>
>> Just wondering could we make this another scheduler feature?
>
> No; sched_feat() is for debugging, BIG CLUE: its guarded by
> CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
> broken.
>
> If people are using it, I should remove or at least randomize the
> interface.

Fair enough... but is there any suggestions on how to handle this issue?

Currently when dbench running with stress, it could only gain one CPU,
and cpu-cgroup cpu.shares is meaningless, is there any good methods to
address that?

Regards,
Michael Wang

>

2014-07-01 08:57:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Tue, Jul 01, 2014 at 04:38:58PM +0800, Michael wang wrote:
> On 07/01/2014 04:20 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> Just wondering could we make this another scheduler feature?
> >
> > No; sched_feat() is for debugging, BIG CLUE: its guarded by
> > CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
> > broken.
> >
> > If people are using it, I should remove or at least randomize the
> > interface.
>
> Fair enough... but is there any suggestions on how to handle this issue?
>
> Currently when dbench running with stress, it could only gain one CPU,
> and cpu-cgroup cpu.shares is meaningless, is there any good methods to
> address that?

So as far as I understood this is because of 'other' tasks; these other
tasks have never been fully qualified, but I suspect they're workqueue
thingies. One solution would be to have work run in the same cgroup as
the task creating it.

The thing is, this is not a scheduler problem, so we should not fix it
in the scheduler.


Attachments:
(No filename) (0.98 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-02 02:48:15

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/01/2014 04:56 PM, Peter Zijlstra wrote:
> On Tue, Jul 01, 2014 at 04:38:58PM +0800, Michael wang wrote:
[snip]
>> Currently when dbench running with stress, it could only gain one CPU,
>> and cpu-cgroup cpu.shares is meaningless, is there any good methods to
>> address that?
>
> So as far as I understood this is because of 'other' tasks; these other
> tasks have never been fully qualified, but I suspect they're workqueue
> thingies. One solution would be to have work run in the same cgroup as
> the task creating it.

Yes, they are kworkers.

>
> The thing is, this is not a scheduler problem, so we should not fix it
> in the scheduler.

I won't argue on that point, while even we get rid of the cgroup stuff,
let's say we just run 12 stress, then 'dbench 6' will suffered a low
performance too, this kind of mixed workloads is not treated well enough
from the view of dbench, and no methods provided by scheduler could
address that...

The opinion on features actually make me a little confusing... I used to
think the scheduler is willing on providing kinds of way to adapt itself
to different situation, and some features do help on that, make them
only a debug option make the scheduler more static to users, but I know
that's not what I should touched...

Anyway, the problem is still there and seems like we have to adopt some
optional solution to address it, we'll still working and practice on
that, please let us know if you have any ideas we could help on ;-)

Regards,
Michael Wang

>

2014-07-02 12:49:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Wed, Jul 02, 2014 at 10:47:34AM +0800, Michael wang wrote:
> The opinion on features actually make me a little confusing... I used to
> think the scheduler is willing on providing kinds of way to adapt itself
> to different situation, and some features do help on that, make them
> only a debug option make the scheduler more static to users, but I know
> that's not what I should touched...
>
> Anyway, the problem is still there and seems like we have to adopt some
> optional solution to address it, we'll still working and practice on
> that, please let us know if you have any ideas we could help on ;-)

No, its very much not intended as a user knob to adapt.

The problem with that approach to scheduling is that one set of knobs
runs workload A, while another set of knobs runs workload B, but what
happens if you need to run both A and B on the same machine?

So optional things and knobs are a complete fail and will not happen.

Clearly I need to go take out all these things because people don't seem
to know this and SCHED_DEBUG isn't a big enough hint. Tedious.


Attachments:
(No filename) (1.05 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-02 13:15:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Wed, Jul 02, 2014 at 02:49:18PM +0200, Peter Zijlstra wrote:
> Clearly I need to go take out all these things because people don't seem
> to know this and SCHED_DEBUG isn't a big enough hint. Tedious.

Maybe this would be enough clue?

---
include/linux/kernel.h | 1 +
include/linux/sched/sysctl.h | 6 +++++
kernel/panic.c | 2 ++
kernel/sched/core.c | 59 +++++++++++++++++++++++++++++++++++---------
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 2 ++
kernel/sysctl.c | 18 +++++++-------
7 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4c52907a6d8b..e2dd5ca9e6bf 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -470,6 +470,7 @@ extern enum system_states {
#define TAINT_FIRMWARE_WORKAROUND 11
#define TAINT_OOT_MODULE 12
#define TAINT_UNSIGNED_MODULE 13
+#define TAINT_DEBUG 14

extern const char hex_asc[];
#define hex_asc_lo(x) hex_asc[((x) & 0x0f)]
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e007c62..9d8f04bc555d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -60,6 +60,12 @@ extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
extern unsigned int sysctl_sched_shares_window;

+int sched_debug_proc_dointvec_minmax(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
+int sched_debug_proc_doulongvec_minmax(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
diff --git a/kernel/panic.c b/kernel/panic.c
index 62e16cef9cc2..e407bf59546c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -224,6 +224,7 @@ static const struct tnt tnts[] = {
{ TAINT_FIRMWARE_WORKAROUND, 'I', ' ' },
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE, 'E', ' ' },
+ { TAINT_DEBUG, 'G', ' ' },
};

/**
@@ -243,6 +244,7 @@ static const struct tnt tnts[] = {
* 'I' - Working around severe firmware bug.
* 'O' - Out-of-tree module has been loaded.
* 'E' - Unsigned module has been loaded.
+ * 'G' - The user fiddled with nonstandard settings.
*
* The string is overwritten by the next call to print_tainted().
*/
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f11515cf070b..851f6bf6abea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -159,6 +159,36 @@ const_debug unsigned int sysctl_sched_features =
#undef SCHED_FEAT

#ifdef CONFIG_SCHED_DEBUG
+void sched_debug_taint(void)
+{
+ static bool once;
+
+ if (once)
+ return;
+
+ once = true;
+ add_taint(TAINT_DEBUG, true);
+ printk(KERN_WARN "Tained kernel 'G' -- poking at debug settings.\n");
+}
+
+int sched_debug_proc_dointvec_minmax(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ if (write)
+ sched_debug_taint();
+
+ return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+}
+
+int sched_debug_proc_doulongvec_minmax(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ if (write)
+ sched_debug_taint();
+
+ return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+}
+
#define SCHED_FEAT(name, enabled) \
#name ,

@@ -246,6 +276,8 @@ sched_feat_write(struct file *filp, const char __user *ubuf,
char *cmp;
int i;

+ sched_debug_taint();
+
if (cnt > 63)
cnt = 63;

@@ -1855,6 +1887,9 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
int err;
int state = numabalancing_enabled;

+ if (write)
+ sched_debug_taint();
+
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;

@@ -4956,31 +4991,31 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
return NULL;

set_table_entry(&table[0], "min_interval", &sd->min_interval,
- sizeof(long), 0644, proc_doulongvec_minmax, false);
+ sizeof(long), 0644, sched_debug_proc_doulongvec_minmax, false);
set_table_entry(&table[1], "max_interval", &sd->max_interval,
- sizeof(long), 0644, proc_doulongvec_minmax, false);
+ sizeof(long), 0644, sched_debug_proc_doulongvec_minmax, false);
set_table_entry(&table[2], "busy_idx", &sd->busy_idx,
- sizeof(int), 0644, proc_dointvec_minmax, true);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, true);
set_table_entry(&table[3], "idle_idx", &sd->idle_idx,
- sizeof(int), 0644, proc_dointvec_minmax, true);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, true);
set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx,
- sizeof(int), 0644, proc_dointvec_minmax, true);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, true);
set_table_entry(&table[5], "wake_idx", &sd->wake_idx,
- sizeof(int), 0644, proc_dointvec_minmax, true);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, true);
set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx,
- sizeof(int), 0644, proc_dointvec_minmax, true);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, true);
set_table_entry(&table[7], "busy_factor", &sd->busy_factor,
- sizeof(int), 0644, proc_dointvec_minmax, false);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, false);
set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
- sizeof(int), 0644, proc_dointvec_minmax, false);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, false);
set_table_entry(&table[9], "cache_nice_tries",
&sd->cache_nice_tries,
- sizeof(int), 0644, proc_dointvec_minmax, false);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, false);
set_table_entry(&table[10], "flags", &sd->flags,
- sizeof(int), 0644, proc_dointvec_minmax, false);
+ sizeof(int), 0644, sched_debug_proc_dointvec_minmax, false);
set_table_entry(&table[11], "max_newidle_lb_cost",
&sd->max_newidle_lb_cost,
- sizeof(long), 0644, proc_doulongvec_minmax, false);
+ sizeof(long), 0644, sched_debug_proc_doulongvec_minmax, false);
set_table_entry(&table[12], "name", sd->name,
CORENAME_MAX_SIZE, 0444, proc_dostring, false);
/* &table[13] is terminator */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 923fe32db6b3..e09202eb348f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -577,6 +577,9 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
int factor = get_update_sysctl_factor();

+ if (write)
+ sched_debug_taint();
+
if (ret || !write)
return ret;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0191ed563bdd..fa880ca9ee4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -874,6 +874,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
# define const_debug const
#endif

+extern void sched_debug_taint(void);
+
extern const_debug unsigned int sysctl_sched_features;

#define SCHED_FEAT(name, enabled) \
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7de6555cfea0..f9810984f3ca 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -328,35 +328,35 @@ static struct ctl_table kern_table[] = {
.data = &sysctl_sched_migration_cost,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "sched_nr_migrate",
.data = &sysctl_sched_nr_migrate,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "sched_time_avg_ms",
.data = &sysctl_sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "sched_shares_window_ns",
.data = &sysctl_sched_shares_window,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "timer_migration",
.data = &sysctl_timer_migration,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec_minmax,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
.extra1 = &zero,
.extra2 = &one,
},
@@ -367,28 +367,28 @@ static struct ctl_table kern_table[] = {
.data = &sysctl_numa_balancing_scan_delay,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "numa_balancing_scan_period_min_ms",
.data = &sysctl_numa_balancing_scan_period_min,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "numa_balancing_scan_period_max_ms",
.data = &sysctl_numa_balancing_scan_period_max,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "numa_balancing_scan_size_mb",
.data = &sysctl_numa_balancing_scan_size,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sched_debug_proc_dointvec_minmax,
},
{
.procname = "numa_balancing",


Attachments:
(No filename) (9.19 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-02 14:48:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/01/2014 04:38 AM, Michael wang wrote:
> On 07/01/2014 04:20 PM, Peter Zijlstra wrote:
> [snip]
>>>
>>> Just wondering could we make this another scheduler feature?
>>
>> No; sched_feat() is for debugging, BIG CLUE: its guarded by
>> CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
>> broken.
>>
>> If people are using it, I should remove or at least randomize the
>> interface.
>
> Fair enough... but is there any suggestions on how to handle this issue?
>
> Currently when dbench running with stress, it could only gain one CPU,
> and cpu-cgroup cpu.shares is meaningless, is there any good methods to
> address that?

select_idle_sibling will iterate over all of the CPUs
in an LLC domain if there is no idle cpu in the domain.

I suspect it would not take much extra code to track
down the idlest CPU in the LLC domain, and make sure to
schedule tasks there, in case no completely idle CPU
was found.

Are there any major problems with that thinking?

2014-07-03 02:10:06

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/02/2014 08:49 PM, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:47:34AM +0800, Michael wang wrote:
>> The opinion on features actually make me a little confusing... I used to
>> think the scheduler is willing on providing kinds of way to adapt itself
>> to different situation, and some features do help on that, make them
>> only a debug option make the scheduler more static to users, but I know
>> that's not what I should touched...
>>
>> Anyway, the problem is still there and seems like we have to adopt some
>> optional solution to address it, we'll still working and practice on
>> that, please let us know if you have any ideas we could help on ;-)
>
> No, its very much not intended as a user knob to adapt.
>
> The problem with that approach to scheduling is that one set of knobs
> runs workload A, while another set of knobs runs workload B, but what
> happens if you need to run both A and B on the same machine?

That's actually what make me think the scheduler should be able to adapt
itself, our performance testing mostly based on static kinds of
workload, and that's too far away from the real world.

We tested A, we tested B, we tested A + B, but what if C joined? what if
folks only like A, don't care B?

The key point here is, scheduler still can't promise that all kinds of
situation is taking good care under it's static policy... it just
running there, won't get any feedback from user, thinking it's doing the
right thing, but sometimes not from admin's view.

>
> So optional things and knobs are a complete fail and will not happen.

But how could we promise that our static policy is the best?

For example the 'sysctl_sched_min_granularity', why it should be 750k?
is that the best on any kinds of platform to any kinds of workloads at
any time?

Regards,
Michael Wang

>
> Clearly I need to go take out all these things because people don't seem
> to know this and SCHED_DEBUG isn't a big enough hint. Tedious.
>

2014-07-03 02:17:08

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/02/2014 10:47 PM, Rik van Riel wrote:
> On 07/01/2014 04:38 AM, Michael wang wrote:
>> On 07/01/2014 04:20 PM, Peter Zijlstra wrote:
>> [snip]
>>>>
>>>> Just wondering could we make this another scheduler feature?
>>>
>>> No; sched_feat() is for debugging, BIG CLUE: its guarded by
>>> CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
>>> broken.
>>>
>>> If people are using it, I should remove or at least randomize the
>>> interface.
>>
>> Fair enough... but is there any suggestions on how to handle this issue?
>>
>> Currently when dbench running with stress, it could only gain one CPU,
>> and cpu-cgroup cpu.shares is meaningless, is there any good methods to
>> address that?

Hi, Rik

>
> select_idle_sibling will iterate over all of the CPUs
> in an LLC domain if there is no idle cpu in the domain.
>
> I suspect it would not take much extra code to track
> down the idlest CPU in the LLC domain, and make sure to
> schedule tasks there, in case no completely idle CPU
> was found.
>
> Are there any major problems with that thinking?

There are some try to improve that part previously, but testing show
that logical is somewhat cheap and good...

And in our cases this is cheap too since no idle CPU is there, 12 stress
will occupy all the CPU, and select_idle_sibling() will go through the
path very quick, since whenever there is a non-idle CPU in sg, sg will
be abandoned.

Regards,
Michael Wang

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-03 03:51:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Wed, 2014-07-02 at 10:47 -0400, Rik van Riel wrote:
> On 07/01/2014 04:38 AM, Michael wang wrote:
> > On 07/01/2014 04:20 PM, Peter Zijlstra wrote:
> > [snip]
> >>>
> >>> Just wondering could we make this another scheduler feature?
> >>
> >> No; sched_feat() is for debugging, BIG CLUE: its guarded by
> >> CONFIG_SCHED_DEBUG, anybody using it in production or anywhere else is
> >> broken.
> >>
> >> If people are using it, I should remove or at least randomize the
> >> interface.
> >
> > Fair enough... but is there any suggestions on how to handle this issue?
> >
> > Currently when dbench running with stress, it could only gain one CPU,
> > and cpu-cgroup cpu.shares is meaningless, is there any good methods to
> > address that?
>
> select_idle_sibling will iterate over all of the CPUs
> in an LLC domain if there is no idle cpu in the domain.
>
> I suspect it would not take much extra code to track
> down the idlest CPU in the LLC domain, and make sure to
> schedule tasks there, in case no completely idle CPU
> was found.
>
> Are there any major problems with that thinking?

That's full wake balance.. if that was cheap, select_idle_sibling()
would not exist.

-Mike

2014-07-03 16:29:17

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On Wed, 2 Jul 2014, Peter Zijlstra wrote:

> On Wed, Jul 02, 2014 at 02:49:18PM +0200, Peter Zijlstra wrote:
> > Clearly I need to go take out all these things because people don't seem
> > to know this and SCHED_DEBUG isn't a big enough hint. Tedious.
>
> Maybe this would be enough clue?
>
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 4c52907a6d8b..e2dd5ca9e6bf 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -470,6 +470,7 @@ extern enum system_states {
> #define TAINT_FIRMWARE_WORKAROUND 11
> #define TAINT_OOT_MODULE 12
> #define TAINT_UNSIGNED_MODULE 13
> +#define TAINT_DEBUG 14

LOL ;-)


Nicolas

2014-07-11 16:12:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/02/2014 11:51 PM, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 10:47 -0400, Rik van Riel wrote:
>> On 07/01/2014 04:38 AM, Michael wang wrote:
>>> On 07/01/2014 04:20 PM, Peter Zijlstra wrote: [snip]
>>>>>
>>>>> Just wondering could we make this another scheduler
>>>>> feature?
>>>>
>>>> No; sched_feat() is for debugging, BIG CLUE: its guarded by
>>>> CONFIG_SCHED_DEBUG, anybody using it in production or
>>>> anywhere else is broken.
>>>>
>>>> If people are using it, I should remove or at least randomize
>>>> the interface.
>>>
>>> Fair enough... but is there any suggestions on how to handle
>>> this issue?
>>>
>>> Currently when dbench running with stress, it could only gain
>>> one CPU, and cpu-cgroup cpu.shares is meaningless, is there any
>>> good methods to address that?
>>
>> select_idle_sibling will iterate over all of the CPUs in an LLC
>> domain if there is no idle cpu in the domain.
>>
>> I suspect it would not take much extra code to track down the
>> idlest CPU in the LLC domain, and make sure to schedule tasks
>> there, in case no completely idle CPU was found.
>>
>> Are there any major problems with that thinking?
>
> That's full wake balance.. if that was cheap,
> select_idle_sibling() would not exist.

Full wake balance iterates over all the groups in the system,
select_idle_sibling only over one LLC domain.

If no CPU in the LLC domain is idle, select_idle_sibling will
iterate over all of them. During this iteration, which the
code already does, it should be possible to identify the least
loaded of the CPUs and pick that one.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTwAzBAAoJEM553pKExN6DlrEH/RKQPdAdMFK/pxZ/2f9TCXFK
Vq25LWZeJQhNOrH3Q6VzTTfAG06O8+Bjxfb+SR6BOHCtD4kCBqaBdwVVUDXC+MbK
NdBa3GtCT3ahvguiYLPEHL1vugND2yzHUgnr9EhUgk6zhnLxfvhIIJj7uv+ZRsri
o8DsLrIG1jqDGVbbu5ssZ37w6cldoFBw0FAHcVAquoM2SP+/MuatW1SCkRP31IVL
q0dssP1CD0Nkecz+S6WSI59c0Wt0c73oWNg/q41a/kha7RI1J5VF5yNFacq/uL0g
Xxyb0mOiJarqMtzuq5ItlOiTry+BpqY1jFhN5ZhFjt9mtvpTR1C/tcXpOw77y0Y=
=BEJk
-----END PGP SIGNATURE-----

2014-07-14 01:29:41

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

On 07/12/2014 12:11 AM, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
[snip]
>>
>> That's full wake balance.. if that was cheap,
>> select_idle_sibling() would not exist.
>
> Full wake balance iterates over all the groups in the system,
> select_idle_sibling only over one LLC domain.

Furthermore, balance will calculate the load and do comparison, which
cost many cycles, while select_idle_sibling() just check idle status.

>
> If no CPU in the LLC domain is idle, select_idle_sibling will
> iterate over all of them.

Just the first cpu of groups, for example the cpu0:

sd1:
tg0 tg1
cpu0 cpu1 cpu2 cpu3
sd0:
tg0 tg1
cpu0 cpu1

The thing we'll check is:

sd1:
idle_cpu(0) //busy
idle_cpu(2) //busy
sd0:
idle_cpu(0) //busy

That's really cheap compared with balance path...

During this iteration, which the
> code already does, it should be possible to identify the least
> loaded of the CPUs and pick that one.

That will make select_idle_sibling() more close to balance path(just the
start domain lower), IMHO this seems like not such a good idea... what
we gain doesn't worth the overhead.

But if we have testing show this modify could benefit most of the
workloads (I don't think so but who knows...), then we'll have the
reason to add some load comparison logical inside that quick path ;-)

Regards,
Michael Wang

>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJTwAzBAAoJEM553pKExN6DlrEH/RKQPdAdMFK/pxZ/2f9TCXFK
> Vq25LWZeJQhNOrH3Q6VzTTfAG06O8+Bjxfb+SR6BOHCtD4kCBqaBdwVVUDXC+MbK
> NdBa3GtCT3ahvguiYLPEHL1vugND2yzHUgnr9EhUgk6zhnLxfvhIIJj7uv+ZRsri
> o8DsLrIG1jqDGVbbu5ssZ37w6cldoFBw0FAHcVAquoM2SP+/MuatW1SCkRP31IVL
> q0dssP1CD0Nkecz+S6WSI59c0Wt0c73oWNg/q41a/kha7RI1J5VF5yNFacq/uL0g
> Xxyb0mOiJarqMtzuq5ItlOiTry+BpqY1jFhN5ZhFjt9mtvpTR1C/tcXpOw77y0Y=
> =BEJk
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>