Arm DynamiQ system can integrate cores with different micro architecture
or max OPP under the same DSU so we can have cores with different compute
capacity at the LLC (which was not the case with legacy big/LITTLE
architecture). Such configuration is similar in some way to ITMT on intel
platform which allows some cores to be boosted to higher turbo frequency
than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
highest capacity, will always be used in priortiy in order to provide
maximum throughput.
Add arch_asym_cpu_priority() for arm64 as this function is used to
differentiate CPUs in the scheduler. The CPU's capacity is used to order
CPUs in the same DSU.
Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
at MC level.
Some tests have been done on a hikey960 platform (quad cortex-A53,
quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
has been modified so the 8 heterogeneous cores are described as being part
of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
Results below show the time in seconds to run sysbench --test=cpu with an
increasing number of threads. The sysbench test run 32 times
without patch with patch diff
1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19%
2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20%
3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22%
4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28%
5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21%
6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17%
7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7%
8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0%
Results show a better and stable results across iteration with the patch
compared to mainline because we are always using big cores in priority whereas
with mainline, the scheduler randomly choose a big or a little cores when
there are more cores than number of threads.
With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
mainline whereas it stays in the range [8.85..8.87] with the patch
Signed-off-by: Vincent Guittot <[email protected]>
---
The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch
to enable this dynamically at boot time by detecting the system topology.
arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 2186853..cb6705e5 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
}
}
+#ifdef CONFIG_SCHED_MC
+unsigned int __read_mostly arm64_sched_asym_enabled;
+
+int arch_asym_cpu_priority(int cpu)
+{
+ return topology_get_cpu_scale(NULL, cpu);
+}
+
+static inline int arm64_sched_dynamiq(void)
+{
+ return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
+}
+
+static int arm64_core_flags(void)
+{
+ return cpu_core_flags() | arm64_sched_dynamiq();
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_MC
+ { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
+#endif
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
void __init init_cpu_topology(void)
{
reset_cpu_topology();
@@ -306,4 +333,7 @@ void __init init_cpu_topology(void)
*/
if (of_have_populated_dt() && parse_dt_topology())
reset_cpu_topology();
+
+ /* Set scheduler topology descriptor */
+ set_sched_topology(arm64_topology);
}
--
2.7.4
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
>
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
>
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
>
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
>
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
>
> without patch with patch diff
> 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19%
> 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20%
> 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22%
> 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28%
> 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21%
> 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17%
> 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7%
> 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0%
>
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch
>
> Signed-off-by: Vincent Guittot <[email protected]>
>
> ---
>
> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch
> to enable this dynamically at boot time by detecting the system topology.
>
> arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 2186853..cb6705e5 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
> }
> }
>
> +#ifdef CONFIG_SCHED_MC
> +unsigned int __read_mostly arm64_sched_asym_enabled;
> +
> +int arch_asym_cpu_priority(int cpu)
> +{
> + return topology_get_cpu_scale(NULL, cpu);
> +}
> +
> +static inline int arm64_sched_dynamiq(void)
> +{
> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
> +}
> +
> +static int arm64_core_flags(void)
> +{
> + return cpu_core_flags() | arm64_sched_dynamiq();
> +}
> +#endif
> +
> +static struct sched_domain_topology_level arm64_topology[] = {
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
Maybe stick this in a macro to avoid the double #ifdef?
Will
On 28 March 2018 at 11:12, Will Deacon <[email protected]> wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
>>
>> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch
>> to enable this dynamically at boot time by detecting the system topology.
>>
>> arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++
>> 1 file changed, 30 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>> index 2186853..cb6705e5 100644
>> --- a/arch/arm64/kernel/topology.c
>> +++ b/arch/arm64/kernel/topology.c
>> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
>> }
>> }
>>
>> +#ifdef CONFIG_SCHED_MC
>> +unsigned int __read_mostly arm64_sched_asym_enabled;
>> +
>> +int arch_asym_cpu_priority(int cpu)
>> +{
>> + return topology_get_cpu_scale(NULL, cpu);
>> +}
>> +
>> +static inline int arm64_sched_dynamiq(void)
>> +{
>> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
>> +}
>> +
>> +static int arm64_core_flags(void)
>> +{
>> + return cpu_core_flags() | arm64_sched_dynamiq();
>> +}
>> +#endif
>> +
>> +static struct sched_domain_topology_level arm64_topology[] = {
>> +#ifdef CONFIG_SCHED_MC
>> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
>
> Maybe stick this in a macro to avoid the double #ifdef?
ok, I will do that in next version
Vincent
>
> Will
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
>
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
>
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
>
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
>
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
>
> without patch with patch diff
> 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19%
> 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20%
> 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22%
> 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28%
> 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21%
> 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17%
> 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7%
> 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0%
>
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch
Using ASYM_PACKING is essentially an easier but somewhat less accurate
way to achieve the same behaviour for big.LITTLE system as with the
"misfit task" series that been under review here for the last couple of
months.
As I see it, the main differences is that ASYM_PACKING attempts to pack
all tasks regardless of task utilization on the higher capacity cpus
whereas the "misfit task" series carefully picks cpus with tasks they
can't handle so we don't risk migrating tasks which are perfectly
suitable to for a little cpu to a big cpu unnecessarily. Also it is
based directly on utilization and cpu capacity like the capacity
awareness we already have to deal with big.LITTLE in the wake-up path.
Furthermore, it should work for all big.LITTLE systems regardless of the
topology, where I think ASYM_PACKING might not work well for systems
with separate big and little sched_domains.
Have to tried taking the misfit patches for a spin on your setup? I
expect them give you the same behaviour as you report above.
Morten
Hi Morten,
On 29 March 2018 at 14:53, Morten Rasmussen <[email protected]> wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
>> Arm DynamiQ system can integrate cores with different micro architecture
>> or max OPP under the same DSU so we can have cores with different compute
>> capacity at the LLC (which was not the case with legacy big/LITTLE
>> architecture). Such configuration is similar in some way to ITMT on intel
>> platform which allows some cores to be boosted to higher turbo frequency
>> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
>> highest capacity, will always be used in priortiy in order to provide
>> maximum throughput.
>>
>> Add arch_asym_cpu_priority() for arm64 as this function is used to
>> differentiate CPUs in the scheduler. The CPU's capacity is used to order
>> CPUs in the same DSU.
>>
>> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
>> at MC level.
>>
>> Some tests have been done on a hikey960 platform (quad cortex-A53,
>> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
>> has been modified so the 8 heterogeneous cores are described as being part
>> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
>>
>> Results below show the time in seconds to run sysbench --test=cpu with an
>> increasing number of threads. The sysbench test run 32 times
>>
>> without patch with patch diff
>> 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19%
>> 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20%
>> 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22%
>> 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28%
>> 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21%
>> 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17%
>> 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7%
>> 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0%
>>
>> Results show a better and stable results across iteration with the patch
>> compared to mainline because we are always using big cores in priority whereas
>> with mainline, the scheduler randomly choose a big or a little cores when
>> there are more cores than number of threads.
>> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
>> mainline whereas it stays in the range [8.85..8.87] with the patch
>
> Using ASYM_PACKING is essentially an easier but somewhat less accurate
> way to achieve the same behaviour for big.LITTLE system as with the
> "misfit task" series that been under review here for the last couple of
> months.
I think that it's not exactly the same goal although if it's probably
close but ASYM_PACKING ensures that the maximum compute capacity is
used.
>
> As I see it, the main differences is that ASYM_PACKING attempts to pack
> all tasks regardless of task utilization on the higher capacity cpus
> whereas the "misfit task" series carefully picks cpus with tasks they
> can't handle so we don't risk migrating tasks which are perfectly
That's one main difference because misfit task will let middle range
load task on little CPUs which will not provide maximum performance.
I have put an example below
> suitable to for a little cpu to a big cpu unnecessarily. Also it is
> based directly on utilization and cpu capacity like the capacity
> awareness we already have to deal with big.LITTLE in the wake-up path.
> Furthermore, it should work for all big.LITTLE systems regardless of the
> topology, where I think ASYM_PACKING might not work well for systems
> with separate big and little sched_domains.
I haven't look in details if ASYM_PACKING can work correctly on legacy
big/little as I was mainly focus on dynamiQ config but I guess that
might also work
>
> Have to tried taking the misfit patches for a spin on your setup? I
> expect them give you the same behaviour as you report above.
So I have tried both your tests and mine on both patchset and they
provide same results which is somewhat expected as the benches are run
for several seconds.
In other to highlight the main difference between misfit task and
ASYM_PACKING, I have reused your test and reduced the number of
max-request for sysbench so that the test duration was in the range of
hundreds ms.
Hikey960 (emulate dynamiq topology)
min avg(stdev) max
misfit 0.097500 0.114911(+- 10%) 0.138500
asym 0.092500 0.106072(+- 6%) 0.122900
In this case, we can see that ASYM_PACKING is doing better( 8%)
because it migrates sysbench threads on big core as soon as they are
available whereas misfit task has to wait for the utilization to
increase above the 80% which takes around 70ms when starting with an
utilization that is null
Regards,
Vincent
>
> Morten
Hi,
On 30/03/18 13:34, Vincent Guittot wrote:
> Hi Morten,
>
[..]
>>
>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>> all tasks regardless of task utilization on the higher capacity cpus
>> whereas the "misfit task" series carefully picks cpus with tasks they
>> can't handle so we don't risk migrating tasks which are perfectly
>
> That's one main difference because misfit task will let middle range
> load task on little CPUs which will not provide maximum performance.
> I have put an example below
>
>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>> based directly on utilization and cpu capacity like the capacity
>> awareness we already have to deal with big.LITTLE in the wake-up path.
I think that bit is quite important. AFAICT, ASYM_PACKING disregards
task utilization, it only makes sure that (with your patch) tasks will be
migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
later on during nohz balance). I didn't see anything related to ASYM_PACKING
in the wake path.
>> Have to tried taking the misfit patches for a spin on your setup? I
>> expect them give you the same behaviour as you report above.
>
> So I have tried both your tests and mine on both patchset and they
> provide same results which is somewhat expected as the benches are run
> for several seconds.
> In other to highlight the main difference between misfit task and
> ASYM_PACKING, I have reused your test and reduced the number of
> max-request for sysbench so that the test duration was in the range of
> hundreds ms.
>
> Hikey960 (emulate dynamiq topology)
> min avg(stdev) max
> misfit 0.097500 0.114911(+- 10%) 0.138500
> asym 0.092500 0.106072(+- 6%) 0.122900
>
> In this case, we can see that ASYM_PACKING is doing better( 8%)
> because it migrates sysbench threads on big core as soon as they are
> available whereas misfit task has to wait for the utilization to
> increase above the 80% which takes around 70ms when starting with an
> utilization that is null
>
I believe ASYM_PACKING behaves better here because the workload is only
sysbench threads. As stated above, since task utilization is disregarded, I
think we could have a scenario where the big CPUs are filled with "small"
tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
matters here is the order in which the tasks spawn, not their utilization -
which is potentially broken.
There's that bit in *update_sd_pick_busiest()*:
/* No ASYM_PACKING if target CPU is already busy */
if (env->idle == CPU_NOT_IDLE)
return true;
So I'm not entirely sure how realistic that scenario is, but I suppose it
could still happen. Food for thought in any case.
Regards,
Valentin
Hi Valentin,
On 3 April 2018 at 00:27, Valentin Schneider <[email protected]> wrote:
> Hi,
>
> On 30/03/18 13:34, Vincent Guittot wrote:
>> Hi Morten,
>>
> [..]
>>>
>>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>>> all tasks regardless of task utilization on the higher capacity cpus
>>> whereas the "misfit task" series carefully picks cpus with tasks they
>>> can't handle so we don't risk migrating tasks which are perfectly
>>
>> That's one main difference because misfit task will let middle range
>> load task on little CPUs which will not provide maximum performance.
>> I have put an example below
>>
>>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>>> based directly on utilization and cpu capacity like the capacity
>>> awareness we already have to deal with big.LITTLE in the wake-up path.
>
> I think that bit is quite important. AFAICT, ASYM_PACKING disregards
> task utilization, it only makes sure that (with your patch) tasks will be
> migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
> later on during nohz balance). I didn't see anything related to ASYM_PACKING
> in the wake path.
>
>>> Have to tried taking the misfit patches for a spin on your setup? I
>>> expect them give you the same behaviour as you report above.
>>
>> So I have tried both your tests and mine on both patchset and they
>> provide same results which is somewhat expected as the benches are run
>> for several seconds.
>> In other to highlight the main difference between misfit task and
>> ASYM_PACKING, I have reused your test and reduced the number of
>> max-request for sysbench so that the test duration was in the range of
>> hundreds ms.
>>
>> Hikey960 (emulate dynamiq topology)
>> min avg(stdev) max
>> misfit 0.097500 0.114911(+- 10%) 0.138500
>> asym 0.092500 0.106072(+- 6%) 0.122900
>>
>> In this case, we can see that ASYM_PACKING is doing better( 8%)
>> because it migrates sysbench threads on big core as soon as they are
>> available whereas misfit task has to wait for the utilization to
>> increase above the 80% which takes around 70ms when starting with an
>> utilization that is null
>>
>
> I believe ASYM_PACKING behaves better here because the workload is only
> sysbench threads. As stated above, since task utilization is disregarded, I
It behaves better because it doesn't wait for the task's utilization
to reach a level before assuming the task needs high compute capacity.
The utilization gives an idea of the running time of the task not the
performance level that is needed
> think we could have a scenario where the big CPUs are filled with "small"
> tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
> matters here is the order in which the tasks spawn, not their utilization -
> which is potentially broken.
>
> There's that bit in *update_sd_pick_busiest()*:
>
> /* No ASYM_PACKING if target CPU is already busy */
> if (env->idle == CPU_NOT_IDLE)
> return true;
>
> So I'm not entirely sure how realistic that scenario is, but I suppose it
> could still happen. Food for thought in any case.
>
> Regards,
> Valentin
Hi,
On 03/04/18 13:17, Vincent Guittot wrote:
> Hi Valentin,
>
[...]
>>
>> I believe ASYM_PACKING behaves better here because the workload is only
>> sysbench threads. As stated above, since task utilization is disregarded, I
>
> It behaves better because it doesn't wait for the task's utilization
> to reach a level before assuming the task needs high compute capacity.
> The utilization gives an idea of the running time of the task not the
> performance level that is needed
>
That's my point actually. ASYM_PACKING disregards utilization and moves those
threads to the big cores ASAP, which is good here because it's just sysbench
threads.
What I meant was that if the task composition changes, IOW we mix "small"
tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
sysbench threads), we shouldn't assume all of those require to run on a big
CPU. The thing is, ASYM_PACKING can't make the difference between those, so
it'll all come down to which task spawned first.
Furthermore, ASYM_PACKING will forcefully move tasks via active balance
regardless of the imbalance as long as a big CPU is idle.
So we could have a scenario where loads of "small" tasks spawn, and they all
get moved to a big CPU until they're all full (because they're periodic tasks
so the big CPUs will eventually be idle and will pull another task as long as
they get some idle time).
Then, before the load tracking signals of those tasks ramp up high enough
that the load balancer would try to move those to LITTLE CPUs, some "big"
tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
balanced so nothing will be done.
I acknowledge this all sounds convoluted but I hope it highlights what I
think could go wrong with ASYM_PACKING on asymmetric systems.
Regards,
Valentin
On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
> Hi,
>
> On 03/04/18 13:17, Vincent Guittot wrote:
>> Hi Valentin,
>>
> [...]
>>>
>>> I believe ASYM_PACKING behaves better here because the workload is only
>>> sysbench threads. As stated above, since task utilization is disregarded, I
>>
>> It behaves better because it doesn't wait for the task's utilization
>> to reach a level before assuming the task needs high compute capacity.
>> The utilization gives an idea of the running time of the task not the
>> performance level that is needed
>>
>
> That's my point actually. ASYM_PACKING disregards utilization and moves those
> threads to the big cores ASAP, which is good here because it's just sysbench
> threads.
>
> What I meant was that if the task composition changes, IOW we mix "small"
> tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> sysbench threads), we shouldn't assume all of those require to run on a big
> CPU. The thing is, ASYM_PACKING can't make the difference between those, so
That's the 1st point where I tend to disagree: why big cores are only
for long running task and periodic stuff can't need to run on big
cores to get max compute capacity ?
You make the assumption that only long running tasks need high compute
capacity. This patch wants to always provide max compute capacity to
the system and not only long running task
> it'll all come down to which task spawned first.
>
> Furthermore, ASYM_PACKING will forcefully move tasks via active balance
> regardless of the imbalance as long as a big CPU is idle.
>
> So we could have a scenario where loads of "small" tasks spawn, and they all
> get moved to a big CPU until they're all full (because they're periodic tasks
> so the big CPUs will eventually be idle and will pull another task as long as
> they get some idle time).
>
> Then, before the load tracking signals of those tasks ramp up high enough
> that the load balancer would try to move those to LITTLE CPUs, some "big"
> tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
> balanced so nothing will be done.
As explained above, as long as the big CPUs are always used,I don't
think it's a problem. What is a problem is if a task stays on a little
CPU whereas a big CPU is idle because we can provide more throughput
>
>
> I acknowledge this all sounds convoluted but I hope it highlights what I
> think could go wrong with ASYM_PACKING on asymmetric systems.
>
> Regards,
> Valentin
On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
> > Hi,
> >
> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> Hi Valentin,
> >>
> > [...]
> >>>
> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >>> sysbench threads. As stated above, since task utilization is disregarded, I
> >>
> >> It behaves better because it doesn't wait for the task's utilization
> >> to reach a level before assuming the task needs high compute capacity.
> >> The utilization gives an idea of the running time of the task not the
> >> performance level that is needed
> >>
> >
> > That's my point actually. ASYM_PACKING disregards utilization and moves those
> > threads to the big cores ASAP, which is good here because it's just sysbench
> > threads.
> >
> > What I meant was that if the task composition changes, IOW we mix "small"
> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> > sysbench threads), we shouldn't assume all of those require to run on a big
> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>
> That's the 1st point where I tend to disagree: why big cores are only
> for long running task and periodic stuff can't need to run on big
> cores to get max compute capacity ?
> You make the assumption that only long running tasks need high compute
> capacity. This patch wants to always provide max compute capacity to
> the system and not only long running task
There is no way we can tell if a periodic or short-running tasks
requires the compute capacity of a big core or not based on utilization
alone. The utilization can only tell us if a task could potentially use
more compute capacity, i.e. the utilization approaches the compute
capacity of its current cpu.
How we handle low utilization tasks comes down to how we define
"performance" and if we care about the cost of "performance" (e.g.
energy consumption).
Placing a low utilization task on a little cpu should always be fine
from _throughput_ point of view. As long as the cpu has spare cycles it
means that work isn't piling up faster than it can be processed.
However, from a _latency_ (completion time) point of view it might be a
problem, and for latency sensitive tasks I can agree that going for max
capacity might be better choice.
The misfit patches places tasks based on utilization to ensure that
tasks get the _throughput_ they need if possible. This is in line with
the placement policy we have in select_task_rq_fair() already.
We shouldn't forget that what we are discussing here is the default
behaviour when we don't have sufficient knowledge about the tasks in the
scheduler. So we are looking a reasonable middle-of-the-road policy that
doesn't kill your performance or the battery. If user-space has its own
opinion about performance requirements it is free to use task affinity
to control which cpu the task end up on and ensure that the task gets
max capacity always. On top of that we have had interfaces in Android
for years to specify performance requirements for task (groups) to allow
small tasks to be placed on big cpus and big task to be placed on little
cpus depending on their requirements. It is even tied into cpufreq as
well. A lot of effort has gone into Android to get this balance right.
Patrick is working hard on upstreaming some of those features.
In the bigger picture always going for max capacity is not desirable for
well-configured big.LITTLE system. You would never exploit the advantage
of the little cpus as you always use big first and only use little when
the bigs are overloaded at which point having little cpus at all makes
little sense. Vendors build big.LITTLE systems because they want a
better performance/energy trade-off, if they wanted max capacity always,
they would just built big-only systems.
If we would be that concerned about latency, DVFS would be a problem too
and we would use nothing but the performance governor. So seen in the
bigger picture I have to disagree that blindly going for max capacity is
the right default policy for big.LITTLE. As soon as we involve a energy
model in the task placement decisions, it definitely isn't.
Morten
Hi Morten,
On 5 April 2018 at 17:46, Morten Rasmussen <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
>> > Hi,
>> >
>> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> Hi Valentin,
>> >>
>> > [...]
>> >>>
>> >>> I believe ASYM_PACKING behaves better here because the workload is only
>> >>> sysbench threads. As stated above, since task utilization is disregarded, I
>> >>
>> >> It behaves better because it doesn't wait for the task's utilization
>> >> to reach a level before assuming the task needs high compute capacity.
>> >> The utilization gives an idea of the running time of the task not the
>> >> performance level that is needed
>> >>
>> >
>> > That's my point actually. ASYM_PACKING disregards utilization and moves those
>> > threads to the big cores ASAP, which is good here because it's just sysbench
>> > threads.
>> >
>> > What I meant was that if the task composition changes, IOW we mix "small"
>> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
>> > sysbench threads), we shouldn't assume all of those require to run on a big
>> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>>
>> That's the 1st point where I tend to disagree: why big cores are only
>> for long running task and periodic stuff can't need to run on big
>> cores to get max compute capacity ?
>> You make the assumption that only long running tasks need high compute
>> capacity. This patch wants to always provide max compute capacity to
>> the system and not only long running task
>
> There is no way we can tell if a periodic or short-running tasks
> requires the compute capacity of a big core or not based on utilization
> alone. The utilization can only tell us if a task could potentially use
> more compute capacity, i.e. the utilization approaches the compute
> capacity of its current cpu.
>
> How we handle low utilization tasks comes down to how we define
> "performance" and if we care about the cost of "performance" (e.g.
> energy consumption).
>
> Placing a low utilization task on a little cpu should always be fine
> from _throughput_ point of view. As long as the cpu has spare cycles it
I disagree, throughput is not only a matter of spare cycle it's also a
matter of how fast you compute the work like with IO activity as an
example
> means that work isn't piling up faster than it can be processed.
> However, from a _latency_ (completion time) point of view it might be a
> problem, and for latency sensitive tasks I can agree that going for max
> capacity might be better choice.
>
> The misfit patches places tasks based on utilization to ensure that
> tasks get the _throughput_ they need if possible. This is in line with
> the placement policy we have in select_task_rq_fair() already.
>
> We shouldn't forget that what we are discussing here is the default
> behaviour when we don't have sufficient knowledge about the tasks in the
> scheduler. So we are looking a reasonable middle-of-the-road policy that
> doesn't kill your performance or the battery. If user-space has its own
But misfit task kills performance and might also kills your battery as
it doesn't prevent small task to run on big cores
The default behavior of the scheduler is to provide max _throughput_
not middle performance and then side activity can mitigate the power
impact like frequency scaling or like EAS which tries to optimize the
usage of energy when system is not overloaded. With misfit task, you
make the assumption that short task on little core is the best
placement to do even for a performance PoV. It seems that you make
some power/performance assumption without using an energy model which
can make such decision. This is all the interest of EAS.
> opinion about performance requirements it is free to use task affinity
> to control which cpu the task end up on and ensure that the task gets
> max capacity always. On top of that we have had interfaces in Android
> for years to specify performance requirements for task (groups) to allow
> small tasks to be placed on big cpus and big task to be placed on little
> cpus depending on their requirements. It is even tied into cpufreq as
> well. A lot of effort has gone into Android to get this balance right.
> Patrick is working hard on upstreaming some of those features.
>
> In the bigger picture always going for max capacity is not desirable for
> well-configured big.LITTLE system. You would never exploit the advantage
> of the little cpus as you always use big first and only use little when
> the bigs are overloaded at which point having little cpus at all makes
If i'm not wrong misfit task patchset doesn't prevent little task to
run on big core
> little sense. Vendors build big.LITTLE systems because they want a
> better performance/energy trade-off, if they wanted max capacity always,
> they would just built big-only systems.
And that's all the purpose of the EAS patchset. EAS patchset is there
to put some energy awareness in the scheduler decision. There is 2
running mode for EAS: one when there is spare cycles so tasks can be
placed to optimize energy consumption. And one when the system or part
of the system is overloaded and it goes back to default performance
mode because there is no interest for energy efficiency and we just
want to provide max performance. So the asym packing fits with this
latter mode as it provide the max compute capacity to the default mode
and doesn't break EAS as it uses the load balance which is disable by
EAS in not overloaded mode
Vincent
>
> If we would be that concerned about latency, DVFS would be a problem too
> and we would use nothing but the performance governor. So seen in the
> bigger picture I have to disagree that blindly going for max capacity is
> the right default policy for big.LITTLE. As soon as we involve a energy
> model in the task placement decisions, it definitely isn't.
>
> Morten
On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> Hi Morten,
>
> On 5 April 2018 at 17:46, Morten Rasmussen <[email protected]> wrote:
> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> >> Hi Valentin,
> >> >>
> >> > [...]
> >> >>>
> >> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >> >>> sysbench threads. As stated above, since task utilization is disregarded, I
> >> >>
> >> >> It behaves better because it doesn't wait for the task's utilization
> >> >> to reach a level before assuming the task needs high compute capacity.
> >> >> The utilization gives an idea of the running time of the task not the
> >> >> performance level that is needed
> >> >>
> >> >
> >> > That's my point actually. ASYM_PACKING disregards utilization and moves those
> >> > threads to the big cores ASAP, which is good here because it's just sysbench
> >> > threads.
> >> >
> >> > What I meant was that if the task composition changes, IOW we mix "small"
> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> >> > sysbench threads), we shouldn't assume all of those require to run on a big
> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
> >>
> >> That's the 1st point where I tend to disagree: why big cores are only
> >> for long running task and periodic stuff can't need to run on big
> >> cores to get max compute capacity ?
> >> You make the assumption that only long running tasks need high compute
> >> capacity. This patch wants to always provide max compute capacity to
> >> the system and not only long running task
> >
> > There is no way we can tell if a periodic or short-running tasks
> > requires the compute capacity of a big core or not based on utilization
> > alone. The utilization can only tell us if a task could potentially use
> > more compute capacity, i.e. the utilization approaches the compute
> > capacity of its current cpu.
> >
> > How we handle low utilization tasks comes down to how we define
> > "performance" and if we care about the cost of "performance" (e.g.
> > energy consumption).
> >
> > Placing a low utilization task on a little cpu should always be fine
> > from _throughput_ point of view. As long as the cpu has spare cycles it
>
> I disagree, throughput is not only a matter of spare cycle it's also a
> matter of how fast you compute the work like with IO activity as an
> example
From a cpu centric point of view it is, but I agree that from a
application/user point of view completion time might impact throughput
too. For example of if your throughput depends on how fast you can
offload work to some peripheral device (GPU for example).
However, as I said in the beginning we don't know what the task does.
> > means that work isn't piling up faster than it can be processed.
> > However, from a _latency_ (completion time) point of view it might be a
> > problem, and for latency sensitive tasks I can agree that going for max
> > capacity might be better choice.
> >
> > The misfit patches places tasks based on utilization to ensure that
> > tasks get the _throughput_ they need if possible. This is in line with
> > the placement policy we have in select_task_rq_fair() already.
> >
> > We shouldn't forget that what we are discussing here is the default
> > behaviour when we don't have sufficient knowledge about the tasks in the
> > scheduler. So we are looking a reasonable middle-of-the-road policy that
> > doesn't kill your performance or the battery. If user-space has its own
>
> But misfit task kills performance and might also kills your battery as
> it doesn't prevent small task to run on big cores
As I said it is not perfect for all use-cases, it is middle-of-the-road
approach. But I strongly disagree that it is always a bad choice for
both energy and performance as you suggest. ASYM_PACKING doesn't
guarantee max "throughput" (by your definition) either as you may fill
up your big cores with smaller tasks leaving the big tasks behind on
little cpus.
> The default behavior of the scheduler is to provide max _throughput_
> not middle performance and then side activity can mitigate the power
> impact like frequency scaling or like EAS which tries to optimize the
> usage of energy when system is not overloaded.
That view doesn't fit very well with all activities around integrating
cpufreq and the scheduler. Frequency scaling is an important factor in
optimizing the throughput.
> With misfit task, you
> make the assumption that short task on little core is the best
> placement to do even for a performance PoV.
I never said it was the best placement, I said it was a reasonable
default policy for big.LITTLE systems.
> It seems that you make
> some power/performance assumption without using an energy model which
> can make such decision. This is all the interest of EAS.
I'm trying to see the bigger picture where you seem not to. The
ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric
view and the default policy I'm suggesting doesn't violate that view.
Your own code in group_is_overloaded() follows this view as it is
utilization based and happily accepts partially utilized groups as being
fine without need to be offloaded despite you could have multiple tasks
waiting to execute. CFS doesn't not provide any latency guarantees, but
we of course do the best we can within reason to minimize it.
Seen in the bigger picture I would consider going for max capacity for
big.LITTLE systems more aggressive than using the performance cpufreq
govenor. Nobody does the latter for battery powered devices, hence I
don't see why anyone would to go big-always for big.LITTLE systems.
>
> > opinion about performance requirements it is free to use task affinity
> > to control which cpu the task end up on and ensure that the task gets
> > max capacity always. On top of that we have had interfaces in Android
> > for years to specify performance requirements for task (groups) to allow
> > small tasks to be placed on big cpus and big task to be placed on little
> > cpus depending on their requirements. It is even tied into cpufreq as
> > well. A lot of effort has gone into Android to get this balance right.
> > Patrick is working hard on upstreaming some of those features.
> >
> > In the bigger picture always going for max capacity is not desirable for
> > well-configured big.LITTLE system. You would never exploit the advantage
> > of the little cpus as you always use big first and only use little when
> > the bigs are overloaded at which point having little cpus at all makes
>
> If i'm not wrong misfit task patchset doesn't prevent little task to
> run on big core
It does not, in fact it doesn't touch small tasks at all, that is not
the point of the patch set. The point is to make sure that big tasks
don't get stuck on little cpus. IOW, a selective little to big
migration based on task utilization.
>
> > little sense. Vendors build big.LITTLE systems because they want a
> > better performance/energy trade-off, if they wanted max capacity always,
> > they would just built big-only systems.
>
> And that's all the purpose of the EAS patchset. EAS patchset is there
> to put some energy awareness in the scheduler decision. There is 2
> running mode for EAS: one when there is spare cycles so tasks can be
> placed to optimize energy consumption. And one when the system or part
> of the system is overloaded and it goes back to default performance
> mode because there is no interest for energy efficiency and we just
> want to provide max performance. So the asym packing fits with this
> latter mode as it provide the max compute capacity to the default mode
> and doesn't break EAS as it uses the load balance which is disable by
> EAS in not overloaded mode
We still care about energy even when we are overutilized. We really
don't want a vastly different placement policy depending on whether we
are overutilized or not if we can avoid it as the situation changes
frequently in many real world scenarios. With ASYM_PACKING everything
could suddenly shift to big cpus if a little cpu is suddenly
overutilized. With the misfit patches, we would detect exactly which
little cpu that needs help, migrate the misfit task and everything will
return to non-overutilized. That is why I said that ASYM_PACKING is
incompatible with energy-aware scheduling and we would need the misfit
patches anyway.
Morten
Hi Morten,
On 6 April 2018 at 14:58, Morten Rasmussen <[email protected]> wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen <[email protected]> wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
[snip]
>> >> > What I meant was that if the task composition changes, IOW we mix "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.
I agree but that's not what you do with misfit as you assume long
running task has higher priority but not shorter running tasks
>
>> > means that work isn't piling up faster than it can be processed.
>> > However, from a _latency_ (completion time) point of view it might be a
>> > problem, and for latency sensitive tasks I can agree that going for max
>> > capacity might be better choice.
>> >
>> > The misfit patches places tasks based on utilization to ensure that
>> > tasks get the _throughput_ they need if possible. This is in line with
>> > the placement policy we have in select_task_rq_fair() already.
>> >
>> > We shouldn't forget that what we are discussing here is the default
>> > behaviour when we don't have sufficient knowledge about the tasks in the
>> > scheduler. So we are looking a reasonable middle-of-the-road policy that
>> > doesn't kill your performance or the battery. If user-space has its own
>>
>> But misfit task kills performance and might also kills your battery as
>> it doesn't prevent small task to run on big cores
>
> As I said it is not perfect for all use-cases, it is middle-of-the-road
> approach. But I strongly disagree that it is always a bad choice for
mmh ... I never said that it's always a bad choice; I said that it can
also easily make bad choice and kills performance and / or battery. In
fact, we can't really predict the behavior of the system as short
running tasks can be randomly put on big or little cores and random
behavior are impossible to predict and mitigate.
> both energy and performance as you suggest. ASYM_PACKING doesn't
> guarantee max "throughput" (by your definition) either as you may fill
> up your big cores with smaller tasks leaving the big tasks behind on
> little cpus.
You didn't understand the point here. Asym ensures the max throughput
to the system because it will provide the max compute capacity per
seconds to the whole system and not only to some specific tasks. You
assume that long running tasks must run on big cores and not short
running tasks. But why filling a big core with long running task and
filling a little core with short running tasks is the best choice ?
Why the opposite should not be better as long as the big core is fully
used ? The goal is to keep big CPU used whatever the type of tasks.
then, there are other mechanism like cgroup to help sorting groups of
tasks.
You try to partially do 2 things at the same time
>
>> The default behavior of the scheduler is to provide max _throughput_
>> not middle performance and then side activity can mitigate the power
>> impact like frequency scaling or like EAS which tries to optimize the
>> usage of energy when system is not overloaded.
>
> That view doesn't fit very well with all activities around integrating
> cpufreq and the scheduler. Frequency scaling is an important factor in
> optimizing the throughput.
>
Here you didn't catch my point too. Pleas don't give me intention that
I don't have.
By side activity, I'm not saying that it should not consolidate the
cpufreq and other framework decisions. Scheduler is the best place to
consolidate CPU related decision. I'm just saying that it's an
additional action taken to optimize energy.
The scheduler doesn't use current frequency in task placement and load
balancing as it assumes that max throughput is available if needed and
adjust frequency to current needs
>
>> With misfit task, you
>> make the assumption that short task on little core is the best
>> placement to do even for a performance PoV.
>
> I never said it was the best placement, I said it was a reasonable
> default policy for big.LITTLE systems.
But "The primary job for the task scheduler is to deliver the highest
possible throughput with minimal latency."
>
>> It seems that you make
>> some power/performance assumption without using an energy model which
>> can make such decision. This is all the interest of EAS.
>
> I'm trying to see the bigger picture where you seem not to. The
Thanks for helping me to get the bigger picture ;-)
> ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric
> view and the default policy I'm suggesting doesn't violate that view.
Sorry I don't catch the sentences above
> Your own code in group_is_overloaded() follows this view as it is
> utilization based and happily accepts partially utilized groups as being
But this is done for SMP system where all cores have same capacity and
to detect when tasks can get more throughput on another CPU.
ASYM_PACKING is there to add capacity awareness in the load balance
when CPUs have different capacity
> fine without need to be offloaded despite you could have multiple tasks
> waiting to execute.
> CFS doesn't not provide any latency guarantees, but
> we of course do the best we can within reason to minimize it.
>
> Seen in the bigger picture I would consider going for max capacity for
> big.LITTLE systems more aggressive than using the performance cpufreq
> govenor. Nobody does the latter for battery powered devices, hence I
> don't see why anyone would to go big-always for big.LITTLE systems.
And that's why EAS exists: to make battery friendly decision
>
>>
>> > opinion about performance requirements it is free to use task affinity
>> > to control which cpu the task end up on and ensure that the task gets
>> > max capacity always. On top of that we have had interfaces in Android
>> > for years to specify performance requirements for task (groups) to allow
>> > small tasks to be placed on big cpus and big task to be placed on little
>> > cpus depending on their requirements. It is even tied into cpufreq as
>> > well. A lot of effort has gone into Android to get this balance right.
>> > Patrick is working hard on upstreaming some of those features.
>> >
>> > In the bigger picture always going for max capacity is not desirable for
>> > well-configured big.LITTLE system. You would never exploit the advantage
>> > of the little cpus as you always use big first and only use little when
>> > the bigs are overloaded at which point having little cpus at all makes
>>
>> If i'm not wrong misfit task patchset doesn't prevent little task to
>> run on big core
>
> It does not, in fact it doesn't touch small tasks at all, that is not
> the point of the patch set. The point is to make sure that big tasks
> don't get stuck on little cpus. IOW, a selective little to big
> migration based on task utilization.
>
>>
>> > little sense. Vendors build big.LITTLE systems because they want a
>> > better performance/energy trade-off, if they wanted max capacity always,
>> > they would just built big-only systems.
>>
>> And that's all the purpose of the EAS patchset. EAS patchset is there
>> to put some energy awareness in the scheduler decision. There is 2
>> running mode for EAS: one when there is spare cycles so tasks can be
>> placed to optimize energy consumption. And one when the system or part
>> of the system is overloaded and it goes back to default performance
>> mode because there is no interest for energy efficiency and we just
>> want to provide max performance. So the asym packing fits with this
>> latter mode as it provide the max compute capacity to the default mode
>> and doesn't break EAS as it uses the load balance which is disable by
>> EAS in not overloaded mode
>
> We still care about energy even when we are overutilized. We really
> don't want a vastly different placement policy depending on whether we
> are overutilized or not if we can avoid it as the situation changes
> frequently in many real world scenarios. With ASYM_PACKING everything
> could suddenly shift to big cpus if a little cpu is suddenly
> overutilized. With the misfit patches, we would detect exactly which
Not everything. The same happens with ASYM_PACKING. It doesn't blindly
put everything on "big" cores and do use parallelism too.
Regards,
Vincent
> little cpu that needs help, migrate the misfit task and everything will
> return to non-overutilized. That is why I said that ASYM_PACKING is
> incompatible with energy-aware scheduling and we would need the misfit
> patches anyway.
>
> Morten
On Mon, Apr 09, 2018 at 09:34:00AM +0200, Vincent Guittot wrote:
> Hi Morten,
>
> On 6 April 2018 at 14:58, Morten Rasmussen <[email protected]> wrote:
> > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> >> Hi Morten,
> >>
> >> On 5 April 2018 at 17:46, Morten Rasmussen <[email protected]> wrote:
> >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> >> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
>
> [snip]
>
> >> >> > What I meant was that if the task composition changes, IOW we mix "small"
> >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> >> >> > sysbench threads), we shouldn't assume all of those require to run on a big
> >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
> >> >>
> >> >> That's the 1st point where I tend to disagree: why big cores are only
> >> >> for long running task and periodic stuff can't need to run on big
> >> >> cores to get max compute capacity ?
> >> >> You make the assumption that only long running tasks need high compute
> >> >> capacity. This patch wants to always provide max compute capacity to
> >> >> the system and not only long running task
> >> >
> >> > There is no way we can tell if a periodic or short-running tasks
> >> > requires the compute capacity of a big core or not based on utilization
> >> > alone. The utilization can only tell us if a task could potentially use
> >> > more compute capacity, i.e. the utilization approaches the compute
> >> > capacity of its current cpu.
> >> >
> >> > How we handle low utilization tasks comes down to how we define
> >> > "performance" and if we care about the cost of "performance" (e.g.
> >> > energy consumption).
> >> >
> >> > Placing a low utilization task on a little cpu should always be fine
> >> > from _throughput_ point of view. As long as the cpu has spare cycles it
> >>
> >> I disagree, throughput is not only a matter of spare cycle it's also a
> >> matter of how fast you compute the work like with IO activity as an
> >> example
> >
> > From a cpu centric point of view it is, but I agree that from a
> > application/user point of view completion time might impact throughput
> > too. For example of if your throughput depends on how fast you can
> > offload work to some peripheral device (GPU for example).
> >
> > However, as I said in the beginning we don't know what the task does.
>
> I agree but that's not what you do with misfit as you assume long
> running task has higher priority but not shorter running tasks
Not really, as I said in the previous replies it comes down what you see
as the goal of the CFS scheduler. With the misfit patches I'm just
trying to make sure that no task is overutilizing a cpu unnecessarily as
this is in line with what load-balancing does for SMP systems. Compute
capacity is distributed as evenly as possible based on utilization just
like it is for load-balancing when task priorities are the same. From
that point of view the misfit patches don't give long running tasks
preferential treatment. However, I do agree that from a completion time
point of view, low utilization tasks could suffer unnecessarily in some
scenarios.
I don't see optimizing for completion time of low utilization tasks as a
primary goal of CFS. Wake-up balancing does try to minimize wake-up
latency, but that is about it. Fork and exec balancing and the
load-balancing code is all based on load and utilization.
Even if we wanted to optimize for completion time it is more tricky for
asymmetric cpu capacity systems than it is for SMP. Just keeping the big
cpus busy all the time isn't going to do it for many scenarios.
Firstly, migrating running tasks is quite expensive so force-migrating a
short-running task could end up taking longer time than letting it
complete on a little cpu.
Secondly, by keeping big cpus busy at all cost you risk that longer
running tasks will either end up queueing on the big cpus if you choose
to enqueue them there anyway, or they could end up running on a little
cpu if you go for the first available cpu in which case you end up
harming the completion time of that task instead. I'm not sure how you
balance which task's completion time is more important differently than
we do today based on load or utilization. The misfit patches use the
latter. We could let it use load instead although I think we have agreed
in the past the comparing load to capacity isn't great idea.
Finally, keeping big cpus busy will increase the number of active
migrations a lot.
As said above, I see your point about completion time might suffer in
some cases for low utilization tasks, but I don't see how you can fix
that automagically. ASYM_PACKING has a lot of problematic side-effects.
If use-space knows that completion time is important for a task, there
are already ways to improve that somewhat in mainline (task priority and
pinning), and more powerful solutions in the Android kernel which
Patrick is currently pushing upstream.
>
> >
> >> > means that work isn't piling up faster than it can be processed.
> >> > However, from a _latency_ (completion time) point of view it might be a
> >> > problem, and for latency sensitive tasks I can agree that going for max
> >> > capacity might be better choice.
> >> >
> >> > The misfit patches places tasks based on utilization to ensure that
> >> > tasks get the _throughput_ they need if possible. This is in line with
> >> > the placement policy we have in select_task_rq_fair() already.
> >> >
> >> > We shouldn't forget that what we are discussing here is the default
> >> > behaviour when we don't have sufficient knowledge about the tasks in the
> >> > scheduler. So we are looking a reasonable middle-of-the-road policy that
> >> > doesn't kill your performance or the battery. If user-space has its own
> >>
> >> But misfit task kills performance and might also kills your battery as
> >> it doesn't prevent small task to run on big cores
> >
> > As I said it is not perfect for all use-cases, it is middle-of-the-road
> > approach. But I strongly disagree that it is always a bad choice for
>
> mmh ... I never said that it's always a bad choice; I said that it can
> also easily make bad choice and kills performance and / or battery.
You did say "But misfit task kills performance and might...", but never
mind, thanks for clarifying your statement.
> In
> fact, we can't really predict the behavior of the system as short
> running tasks can be randomly put on big or little cores and random
> behavior are impossible to predict and mitigate.
You can't predict the behaviour of the system either if you use
ASYM_PACKING. The short running tasks may or may not be lucky to wake up
when there is a big cpu idle. Performance is a best-effort thing on most
modern systems. ASYM_PACKING might increase the probability that a short
running task ends up on a big cpu, but at the same time it might harm
predictability of completion time of long running tasks.
> > both energy and performance as you suggest. ASYM_PACKING doesn't
> > guarantee max "throughput" (by your definition) either as you may fill
> > up your big cores with smaller tasks leaving the big tasks behind on
> > little cpus.
>
> You didn't understand the point here. Asym ensures the max throughput
> to the system because it will provide the max compute capacity per
> seconds to the whole system and not only to some specific tasks. You
> assume that long running tasks must run on big cores and not short
> running tasks. But why filling a big core with long running task and
> filling a little core with short running tasks is the best choice ?
I'm fairly sure I understand your point. From a theoretical point of
view, if migrations were free and we had no caches, always keeping the
big cpus busy before using the little cpus would get us most throughput.
I don't disagree with that. The issue here is that migrations aren't
free, we do have caches, the CFS scheduler isn't designed to work that
way, and for many real world use-cases on big.LITTLE systems people
don't want to maximize global throughput, they want to maximize
throughput of the important tasks at the expense of everyone else
running slower even if they don't care about energy.
I'm not saying that scheduling short running tasks on little cpus is
always the best choice, but it seems to be a good compromise and it is
in line with the existing load-balancing policy. So I see it as the
least invasive solution to improve things for asymmetric cpu capacity
systems.
> Why the opposite should not be better as long as the big core is fully
> used ? The goal is to keep big CPU used whatever the type of tasks.
> then, there are other mechanism like cgroup to help sorting groups of
> tasks.
Because of all the side-effects I mentioned further up. If your goal is
to keep the big cpus always busy, why not change the wake-up code to
always prefer them instead of trying to catch them later? That seems a
much more reasonable approach since you would migrate short running
tasks at wake-up which is much cheaper and would only require simple
tweaks to the existing capacity-aware wake-up code. Short running tasks
will always be handled there, so we only need to worry about long
running tasks that would be handled by the misfit patches. My worry with
doing that is that big tasks might suffer from additional migrations and
that the policy is too aggressive for users that care about energy, so
it would have to be disabled as soon as an energy model is in use.
> You try to partially do 2 things at the same time
I'm trying to make all the effort in scheduling and OSPM come together
while looking at what users need.
>
> >
> >> The default behavior of the scheduler is to provide max _throughput_
> >> not middle performance and then side activity can mitigate the power
> >> impact like frequency scaling or like EAS which tries to optimize the
> >> usage of energy when system is not overloaded.
> >
> > That view doesn't fit very well with all activities around integrating
> > cpufreq and the scheduler. Frequency scaling is an important factor in
> > optimizing the throughput.
> >
>
> Here you didn't catch my point too. Pleas don't give me intention that
> I don't have.
> By side activity, I'm not saying that it should not consolidate the
> cpufreq and other framework decisions. Scheduler is the best place to
> consolidate CPU related decision. I'm just saying that it's an
> additional action taken to optimize energy.
> The scheduler doesn't use current frequency in task placement and load
> balancing as it assumes that max throughput is available if needed and
> adjust frequency to current needsA
That is the whole problem with mainline scheduling and OSPM that we have
been working on addressing for several years now. Energy-aware
scheduling does exactly that, it considers current frequency as part of
task placement and we actively ask for a suitable frequency based on a
mix of PELT utilization and use-space hints. All this goodness has
already been in the Android kernel for years.
Hence my point above was to say that viewing frequency selection as a
"side activity" doesn't fit with what is being proposed for energy-aware
scheduling.
>
> >
> >> With misfit task, you
> >> make the assumption that short task on little core is the best
> >> placement to do even for a performance PoV.
> >
> > I never said it was the best placement, I said it was a reasonable
> > default policy for big.LITTLE systems.
>
> But "The primary job for the task scheduler is to deliver the highest
> possible throughput with minimal latency."
I'm not sure where that quote is coming from, but I think I have already
covered to great extent above why optimizing for aggressively for
keeping the big cpus busy on asymmetric cpu capacity systems isn't
necessarily the best choice. At least, if we this is what we truly want
ASYM_PACKING is not a good implementation of this policy.
>
> >
> >> It seems that you make
> >> some power/performance assumption without using an energy model which
> >> can make such decision. This is all the interest of EAS.
> >
> > I'm trying to see the bigger picture where you seem not to. The
>
> Thanks for helping me to get the bigger picture ;-)
>
> > ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric
> > view and the default policy I'm suggesting doesn't violate that view.
>
> Sorry I don't catch the sentences above
My point is that ASYM_PACKING conflicts with EAS while the misfit
patches work well with EAS and the resulting behaviour is in line with
load-balancing as I already covered above.
>
> > Your own code in group_is_overloaded() follows this view as it is
> > utilization based and happily accepts partially utilized groups as being
>
> But this is done for SMP system where all cores have same capacity and
> to detect when tasks can get more throughput on another CPU.
But you don't detect scenarios where you could improve completion time.
This is where this discussion started :-)
> ASYM_PACKING is there to add capacity awareness in the load balance
> when CPUs have different capacity
Well, one fundamental difference between asymmetric cpu capacity systems
(big.LITTLE) and the existing users of ASYM_PACKING is that the existing
users of ASYM_PACKING don't have any downsides of using that feature. As
in, the n+1th task to be packed doesn't get punished in terms of
performance just because it woke up later than the other tasks. It is
just placing tasks to improve the chances of an opportunistic
performance boost. This is not the case for asymmetric cpu capacity
systems. Using ASYM_PACKING here would mean that late wakers gets
punished while early risers gets treated with better throughput until
they choose to stop or it gets preempted because there are more tasks
than cpus.
Is it fair to favor the first tasks to wake? I think providing true fairness,
particularly on asymmetric cpu capacity systems, can only be achieved by
using a rotating scheduler, where each task take turns on running on the
fastest cpu ;-)
>
> > fine without need to be offloaded despite you could have multiple tasks
> > waiting to execute.
> > CFS doesn't not provide any latency guarantees, but
> > we of course do the best we can within reason to minimize it.
> >
> > Seen in the bigger picture I would consider going for max capacity for
> > big.LITTLE systems more aggressive than using the performance cpufreq
> > govenor. Nobody does the latter for battery powered devices, hence I
> > don't see why anyone would to go big-always for big.LITTLE systems.
>
> And that's why EAS exists: to make battery friendly decision
True, I'm just wondering if we should spend effort supporting a use-case
which might only be of theoretical interest instead of focusing on the
problems that a lot of users care about.
> >> > opinion about performance requirements it is free to use task affinity
> >> > to control which cpu the task end up on and ensure that the task gets
> >> > max capacity always. On top of that we have had interfaces in Android
> >> > for years to specify performance requirements for task (groups) to allow
> >> > small tasks to be placed on big cpus and big task to be placed on little
> >> > cpus depending on their requirements. It is even tied into cpufreq as
> >> > well. A lot of effort has gone into Android to get this balance right.
> >> > Patrick is working hard on upstreaming some of those features.
> >> >
> >> > In the bigger picture always going for max capacity is not desirable for
> >> > well-configured big.LITTLE system. You would never exploit the advantage
> >> > of the little cpus as you always use big first and only use little when
> >> > the bigs are overloaded at which point having little cpus at all makes
> >>
> >> If i'm not wrong misfit task patchset doesn't prevent little task to
> >> run on big core
> >
> > It does not, in fact it doesn't touch small tasks at all, that is not
> > the point of the patch set. The point is to make sure that big tasks
> > don't get stuck on little cpus. IOW, a selective little to big
> > migration based on task utilization.
> >
> >>
> >> > little sense. Vendors build big.LITTLE systems because they want a
> >> > better performance/energy trade-off, if they wanted max capacity always,
> >> > they would just built big-only systems.
> >>
> >> And that's all the purpose of the EAS patchset. EAS patchset is there
> >> to put some energy awareness in the scheduler decision. There is 2
> >> running mode for EAS: one when there is spare cycles so tasks can be
> >> placed to optimize energy consumption. And one when the system or part
> >> of the system is overloaded and it goes back to default performance
> >> mode because there is no interest for energy efficiency and we just
> >> want to provide max performance. So the asym packing fits with this
> >> latter mode as it provide the max compute capacity to the default mode
> >> and doesn't break EAS as it uses the load balance which is disable by
> >> EAS in not overloaded mode
> >
> > We still care about energy even when we are overutilized. We really
> > don't want a vastly different placement policy depending on whether we
> > are overutilized or not if we can avoid it as the situation changes
> > frequently in many real world scenarios. With ASYM_PACKING everything
> > could suddenly shift to big cpus if a little cpu is suddenly
> > overutilized. With the misfit patches, we would detect exactly which
>
> Not everything. The same happens with ASYM_PACKING. It doesn't blindly
> put everything on "big" cores and do use parallelism too.
I fail to understand your point here. ASYM_PACKING doesn't put multiple
tasks on the same cpu, but it does fill all the big cpus even if all we
really need is to migrate a single big task.
Morten
On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> As said above, I see your point about completion time might suffer in
> some cases for low utilization tasks, but I don't see how you can fix
> that automagically. ASYM_PACKING has a lot of problematic side-effects.
> If use-space knows that completion time is important for a task, there
> are already ways to improve that somewhat in mainline (task priority and
> pinning), and more powerful solutions in the Android kernel which
> Patrick is currently pushing upstream.
So I tend to side with Morten on this one. I don't particularly like
ASYM_PACKING much, but we already had it for PPC and it works for the
small difference in performance ITMI has.
At the time Morten already objected to using it for ITMI, and I just
haven't had time to look into his proposal for using capacity.
But I don't see it working right for big.litte/dynamiq, simply because
it is a very strong always big preference, which is against the whole
design premisis of big.little (as Morten has been trying to argue).
On Thu, Apr 12, 2018 at 08:22:11PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> > As said above, I see your point about completion time might suffer in
> > some cases for low utilization tasks, but I don't see how you can fix
> > that automagically. ASYM_PACKING has a lot of problematic side-effects.
> > If use-space knows that completion time is important for a task, there
> > are already ways to improve that somewhat in mainline (task priority and
> > pinning), and more powerful solutions in the Android kernel which
> > Patrick is currently pushing upstream.
>
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
>
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
>
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).
In Vincent's defence, vendors do sometimes make design decisions that I
don't quite understand. So there could be users that really want a
non-energy-aware big-first policy, but as I said earlier in this thread,
that could be implemented better with a small tweak to wake_cap() and
using the misfit patches.
We would have to disable big-first policy and go with the current
migrate-big-task-to-big-cpus policy as soon as we care about energy. I'm
happy to give that try and come up with a patch.
On 12 April 2018 at 20:22, Peter Zijlstra <[email protected]> wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
>> As said above, I see your point about completion time might suffer in
>> some cases for low utilization tasks, but I don't see how you can fix
>> that automagically. ASYM_PACKING has a lot of problematic side-effects.
>> If use-space knows that completion time is important for a task, there
>> are already ways to improve that somewhat in mainline (task priority and
>> pinning), and more powerful solutions in the Android kernel which
>> Patrick is currently pushing upstream.
>
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
>
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
>
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).
In fact, Little not only gives some better power efficiency but it
also handles far better some stuff like interrupt handling as an
example
Nevertheless, whatever the solution, it will never fit with
big.Little/dynamiQ system without some EAS as soon as the power
efficiency is involved in the equation.
I have planned to test more deeply how ASYM_PACKING works with EAS
when i will have finished others on going activity.
>
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussen
<[email protected]> wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen <[email protected]> wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider <[email protected]> wrote:
>> >> > Hi,
>> >> >
>> >> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> >> Hi Valentin,
>> >> >>
>> >> > [...]
>> >> >>>
>> >> >>> I believe ASYM_PACKING behaves better here because the workload is only
>> >> >>> sysbench threads. As stated above, since task utilization is disregarded, I
>> >> >>
>> >> >> It behaves better because it doesn't wait for the task's utilization
>> >> >> to reach a level before assuming the task needs high compute capacity.
>> >> >> The utilization gives an idea of the running time of the task not the
>> >> >> performance level that is needed
>> >> >>
>> >> >
>> >> > [
>> >> > That's my point actually. ASYM_PACKING disregards utilization and moves those
>> >> > threads to the big cores ASAP, which is good here because it's just sysbench
>> >> > threads.
>> >> >
>> >> > What I meant was that if the task composition changes, IOW we mix "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>> > [Morten]
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> [Vincent]
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> [Morten]
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.
[Joel]
Just wanted to say about Vincent point of IO loads throughput -
remembering from when I was playing with the iowait boost stuff, that
- say you have a little task that does some IO and blocks and does so
periodically. In the scenario the task will run for little time and is
a little task by way of looking at utilization. However, if we were to
run it on the BIG CPUs, the overall throughput of the I/O activity
would be higher.
For this case, it seems its impossible to specify the "default"
behavior correctly. Like, do we care about performance or energy more?
This seems more like a policy-decision from userspace and not
something the scheduler should necessarily have to decide. Like if I/O
activity is background and not affecting the user experience.
thanks,
- Joel