2012-10-04 06:02:06

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

Hi Morten,

On 22 September 2012 00:02, <[email protected]> wrote:
> From: Morten Rasmussen <[email protected]>
>
> This patch introduces the basic SCHED_HMP infrastructure. Each class of
> cpus is represented by a hmp_domain and tasks will only be moved between
> these domains when their load profiles suggest it is beneficial.
>
> SCHED_HMP relies heavily on the task load-tracking introduced in Paul
> Turners fair group scheduling patch set:
>
> <https://lkml.org/lkml/2012/8/23/267>
>
> SCHED_HMP requires that the platform implements arch_get_hmp_domains()
> which should set up the platform specific list of hmp_domains. It is
> also assumed that the platform disables SD_LOAD_BALANCE for the
> appropriate sched_domains.

An explanation of this requirement would be helpful here.

> Tasks placement takes place every time a task is to be inserted into
> a runqueue based on its load history. The task placement decision is
> based on load thresholds.
>
> There are no restrictions on the number of hmp_domains, however,
> multiple (>2) has not been tested and the up/down migration policy is
> rather simple.
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> arch/arm/Kconfig | 17 +++++
> include/linux/sched.h | 6 ++
> kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 6 ++
> 4 files changed, 197 insertions(+)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index f4a5d58..5b09684 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1554,6 +1554,23 @@ config SCHED_SMT
> MultiThreading at a cost of slightly increased overhead in some
> places. If unsure say N here.
>
> +config DISABLE_CPU_SCHED_DOMAIN_BALANCE
> + bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
> + help
> + Disables scheduler load-balancing at CPU sched domain level.

Shouldn't this depend on EXPERIMENTAL?

> +config SCHED_HMP
> + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"

ditto.

> + depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
> + help
> + Experimental scheduler optimizations for heterogeneous platforms.
> + Attempts to introspectively select task affinity to optimize power
> + and performance. Basic support for multiple (>2) cpu types is in place,
> + but it has only been tested with two types of cpus.
> + There is currently no support for migration of task groups, hence
> + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
> + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
> +
> config HAVE_ARM_SCU
> bool
> help
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 81e4e82..df971a3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
>
> bool cpus_share_cache(int this_cpu, int that_cpu);
>
> +#ifdef CONFIG_SCHED_HMP
> +struct hmp_domain {
> + struct cpumask cpus;
> + struct list_head hmp_domains;

Probably need a better name here. domain_list?

> +};
> +#endif /* CONFIG_SCHED_HMP */
> #else /* CONFIG_SMP */
>
> struct sched_domain_attr;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e17dd5..d80de46 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target)
> return target;
> }
>
> +#ifdef CONFIG_SCHED_HMP
> +/*
> + * Heterogenous multiprocessor (HMP) optimizations
> + *
> + * The cpu types are distinguished using a list of hmp_domains
> + * which each represent one cpu type using a cpumask.
> + * The list is assumed ordered by compute capacity with the
> + * fastest domain first.
> + */
> +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> +
> +extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
> +
> +/* Setup hmp_domains */
> +static int __init hmp_cpu_mask_setup(void)

How should we interpret its return value? Can you mention what does 0 & 1 mean
here?

> +{
> + char buf[64];
> + struct hmp_domain *domain;
> + struct list_head *pos;
> + int dc, cpu;
> +
> + pr_debug("Initializing HMP scheduler:\n");
> +
> + /* Initialize hmp_domains using platform code */
> + arch_get_hmp_domains(&hmp_domains);
> + if (list_empty(&hmp_domains)) {
> + pr_debug("HMP domain list is empty!\n");
> + return 0;
> + }
> +
> + /* Print hmp_domains */
> + dc = 0;

Should be done during definition of dc.

> + list_for_each(pos, &hmp_domains) {
> + domain = list_entry(pos, struct hmp_domain, hmp_domains);
> + cpulist_scnprintf(buf, 64, &domain->cpus);
> + pr_debug(" HMP domain %d: %s\n", dc, buf);

Spaces before HMP are intentional?

> +
> + for_each_cpu_mask(cpu, domain->cpus) {
> + per_cpu(hmp_cpu_domain, cpu) = domain;
> + }

Should use hmp_cpu_domain(cpu) here. Also no need of {} for single
line loop.

> + dc++;

You aren't using it... Only for testing? Should we remove it from mainline
patchset and keep it locally?

> + }
> +
> + return 1;
> +}
> +
> +/*
> + * Migration thresholds should be in the range [0..1023]
> + * hmp_up_threshold: min. load required for migrating tasks to a faster cpu
> + * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
> + * The default values (512, 256) offer good responsiveness, but may need
> + * tweaking suit particular needs.
> + */
> +unsigned int hmp_up_threshold = 512;
> +unsigned int hmp_down_threshold = 256;

For default values, it is fine. But still we should get user preferred
values via DT
or CONFIG_*.

> +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
> +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
> +
> +/* Check if cpu is in fastest hmp_domain */
> +static inline unsigned int hmp_cpu_is_fastest(int cpu)
> +{
> + struct list_head *pos;
> +
> + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> + return pos == hmp_domains.next;

better create list_is_first() for this.

> +}
> +
> +/* Check if cpu is in slowest hmp_domain */
> +static inline unsigned int hmp_cpu_is_slowest(int cpu)
> +{
> + struct list_head *pos;
> +
> + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> + return list_is_last(pos, &hmp_domains);
> +}
> +
> +/* Next (slower) hmp_domain relative to cpu */
> +static inline struct hmp_domain *hmp_slower_domain(int cpu)
> +{
> + struct list_head *pos;
> +
> + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> + return list_entry(pos->next, struct hmp_domain, hmp_domains);
> +}
> +
> +/* Previous (faster) hmp_domain relative to cpu */
> +static inline struct hmp_domain *hmp_faster_domain(int cpu)
> +{
> + struct list_head *pos;
> +
> + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> + return list_entry(pos->prev, struct hmp_domain, hmp_domains);
> +}

For all four routines, first two lines of body can be merged. If u wish :)

> +
> +/*
> + * Selects a cpu in previous (faster) hmp_domain
> + * Note that cpumask_any_and() returns the first cpu in the cpumask
> + */
> +static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
> + int cpu)
> +{
> + return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
> + tsk_cpus_allowed(tsk));
> +}
> +
> +/*
> + * Selects a cpu in next (slower) hmp_domain
> + * Note that cpumask_any_and() returns the first cpu in the cpumask
> + */
> +static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
> + int cpu)
> +{
> + return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
> + tsk_cpus_allowed(tsk));
> +}
> +
> +#endif /* CONFIG_SCHED_HMP */
> +
> /*
> * sched_balance_self: balance the current task (running on cpu) in domains
> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> @@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> unlock:
> rcu_read_unlock();
>
> +#ifdef CONFIG_SCHED_HMP
> + if (hmp_up_migration(prev_cpu, &p->se))
> + return hmp_select_faster_cpu(p, prev_cpu);
> + if (hmp_down_migration(prev_cpu, &p->se))
> + return hmp_select_slower_cpu(p, prev_cpu);
> + /* Make sure that the task stays in its previous hmp domain */
> + if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))

Why is this tested?

> + return prev_cpu;
> +#endif
> +
> return new_cpu;
> }
>
> @@ -5354,6 +5483,41 @@ need_kick:
> static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
> #endif
>
> +#ifdef CONFIG_SCHED_HMP
> +/* Check if task should migrate to a faster cpu */
> +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se)
> +{
> + struct task_struct *p = task_of(se);
> +
> + if (hmp_cpu_is_fastest(cpu))
> + return 0;
> +
> + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
> + tsk_cpus_allowed(p))
> + && se->avg.load_avg_ratio > hmp_up_threshold) {
> + return 1;
> + }

I know all these comparisons are not very costly, still i would prefer

se->avg.load_avg_ratio > hmp_up_threshold

as the first comparison in this routine.

We should check first, if the task needs migration or not. Rather than
checking if it can migrate to other cpus or not.

> + return 0;
> +}
> +
> +/* Check if task should migrate to a slower cpu */
> +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se)
> +{
> + struct task_struct *p = task_of(se);
> +
> + if (hmp_cpu_is_slowest(cpu))
> + return 0;
> +
> + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
> + tsk_cpus_allowed(p))
> + && se->avg.load_avg_ratio < hmp_down_threshold) {
> + return 1;
> + }

same here.

> + return 0;
> +}
> +
> +#endif /* CONFIG_SCHED_HMP */
> +
> /*
> * run_rebalance_domains is triggered when needed from the scheduler tick.
> * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
> @@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void)
> zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> cpu_notifier(sched_ilb_notifier, 0);
> #endif
> +
> +#ifdef CONFIG_SCHED_HMP
> + hmp_cpu_mask_setup();

Should we check the return value? If not required then should we
make fn() declaration return void?

> +#endif
> #endif /* SMP */
>
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 81135f9..4990d9e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
>
> extern int group_balance_cpu(struct sched_group *sg);
>
> +#ifdef CONFIG_SCHED_HMP
> +static LIST_HEAD(hmp_domains);
> +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))

can drop "()" around per_cpu().

Both, per_cpu variable and macro to get it, have the same name. Can
we try giving them better names. Or atleast add an "_" before per_cpu
pointers name?

--
viresh


2012-10-04 06:54:37

by Amit Kucheria

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

On Thu, Oct 4, 2012 at 11:32 AM, Viresh Kumar <[email protected]> wrote:
> Hi Morten,
>
> On 22 September 2012 00:02, <[email protected]> wrote:
>> From: Morten Rasmussen <[email protected]>
>>
>> This patch introduces the basic SCHED_HMP infrastructure. Each class of
>> cpus is represented by a hmp_domain and tasks will only be moved between
>> these domains when their load profiles suggest it is beneficial.
>>
>> SCHED_HMP relies heavily on the task load-tracking introduced in Paul
>> Turners fair group scheduling patch set:
>>
>> <https://lkml.org/lkml/2012/8/23/267>
>>
>> SCHED_HMP requires that the platform implements arch_get_hmp_domains()
>> which should set up the platform specific list of hmp_domains. It is
>> also assumed that the platform disables SD_LOAD_BALANCE for the
>> appropriate sched_domains.
>
> An explanation of this requirement would be helpful here.
>
>> Tasks placement takes place every time a task is to be inserted into
>> a runqueue based on its load history. The task placement decision is
>> based on load thresholds.
>>
>> There are no restrictions on the number of hmp_domains, however,
>> multiple (>2) has not been tested and the up/down migration policy is
>> rather simple.
>>
>> Signed-off-by: Morten Rasmussen <[email protected]>
>> ---
>> arch/arm/Kconfig | 17 +++++
>> include/linux/sched.h | 6 ++
>> kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 6 ++
>> 4 files changed, 197 insertions(+)
>>
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index f4a5d58..5b09684 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -1554,6 +1554,23 @@ config SCHED_SMT
>> MultiThreading at a cost of slightly increased overhead in some
>> places. If unsure say N here.
>>
>> +config DISABLE_CPU_SCHED_DOMAIN_BALANCE
>> + bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
>> + help
>> + Disables scheduler load-balancing at CPU sched domain level.
>
> Shouldn't this depend on EXPERIMENTAL?

EXPERIMENTAL might be on its way out: https://lkml.org/lkml/2012/10/2/398

2012-10-09 15:56:23

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

Hi Viresh,

On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
> Hi Morten,
>
> On 22 September 2012 00:02, <[email protected]> wrote:
> > From: Morten Rasmussen <[email protected]>
> >
> > This patch introduces the basic SCHED_HMP infrastructure. Each class of
> > cpus is represented by a hmp_domain and tasks will only be moved between
> > these domains when their load profiles suggest it is beneficial.
> >
> > SCHED_HMP relies heavily on the task load-tracking introduced in Paul
> > Turners fair group scheduling patch set:
> >
> > <https://lkml.org/lkml/2012/8/23/267>
> >
> > SCHED_HMP requires that the platform implements arch_get_hmp_domains()
> > which should set up the platform specific list of hmp_domains. It is
> > also assumed that the platform disables SD_LOAD_BALANCE for the
> > appropriate sched_domains.
>
> An explanation of this requirement would be helpful here.
>

Yes. This is to prevent the load-balancer from moving tasks between
hmp_domains. This will be done exclusively by SCHED_HMP instead to
implement a strict task migration policy and avoid changing the
load-balancer behaviour. The load-balancer will take care of
load-balacing within each hmp_domain.

> > Tasks placement takes place every time a task is to be inserted into
> > a runqueue based on its load history. The task placement decision is
> > based on load thresholds.
> >
> > There are no restrictions on the number of hmp_domains, however,
> > multiple (>2) has not been tested and the up/down migration policy is
> > rather simple.
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> > ---
> > arch/arm/Kconfig | 17 +++++
> > include/linux/sched.h | 6 ++
> > kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/sched.h | 6 ++
> > 4 files changed, 197 insertions(+)
> >
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index f4a5d58..5b09684 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -1554,6 +1554,23 @@ config SCHED_SMT
> > MultiThreading at a cost of slightly increased overhead in some
> > places. If unsure say N here.
> >
> > +config DISABLE_CPU_SCHED_DOMAIN_BALANCE
> > + bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
> > + help
> > + Disables scheduler load-balancing at CPU sched domain level.
>
> Shouldn't this depend on EXPERIMENTAL?
>

It should. The ongoing discussion about CONFIG_EXPERIMENTAL that Amit is
referring to hasn't come to a conclusion yet.

> > +config SCHED_HMP
> > + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"
>
> ditto.
>
> > + depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
> > + help
> > + Experimental scheduler optimizations for heterogeneous platforms.
> > + Attempts to introspectively select task affinity to optimize power
> > + and performance. Basic support for multiple (>2) cpu types is in place,
> > + but it has only been tested with two types of cpus.
> > + There is currently no support for migration of task groups, hence
> > + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
> > + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
> > +
> > config HAVE_ARM_SCU
> > bool
> > help
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 81e4e82..df971a3 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
> >
> > bool cpus_share_cache(int this_cpu, int that_cpu);
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +struct hmp_domain {
> > + struct cpumask cpus;
> > + struct list_head hmp_domains;
>
> Probably need a better name here. domain_list?
>

Yes. hmp_domain_list would be better and stick with the hmp_* naming
convention.

> > +};
> > +#endif /* CONFIG_SCHED_HMP */
> > #else /* CONFIG_SMP */
> >
> > struct sched_domain_attr;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e17dd5..d80de46 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target)
> > return target;
> > }
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +/*
> > + * Heterogenous multiprocessor (HMP) optimizations
> > + *
> > + * The cpu types are distinguished using a list of hmp_domains
> > + * which each represent one cpu type using a cpumask.
> > + * The list is assumed ordered by compute capacity with the
> > + * fastest domain first.
> > + */
> > +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> > +
> > +extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
> > +
> > +/* Setup hmp_domains */
> > +static int __init hmp_cpu_mask_setup(void)
>
> How should we interpret its return value? Can you mention what does 0 & 1 mean
> here?
>

Returns 0 if domain setup failed, i.e. the domain list is empty, and 1
otherwise.

> > +{
> > + char buf[64];
> > + struct hmp_domain *domain;
> > + struct list_head *pos;
> > + int dc, cpu;
> > +
> > + pr_debug("Initializing HMP scheduler:\n");
> > +
> > + /* Initialize hmp_domains using platform code */
> > + arch_get_hmp_domains(&hmp_domains);
> > + if (list_empty(&hmp_domains)) {
> > + pr_debug("HMP domain list is empty!\n");
> > + return 0;
> > + }
> > +
> > + /* Print hmp_domains */
> > + dc = 0;
>
> Should be done during definition of dc.
>
> > + list_for_each(pos, &hmp_domains) {
> > + domain = list_entry(pos, struct hmp_domain, hmp_domains);
> > + cpulist_scnprintf(buf, 64, &domain->cpus);
> > + pr_debug(" HMP domain %d: %s\n", dc, buf);
>
> Spaces before HMP are intentional?
>

Yes. It makes the boot log easier to read when the hmp_domain listing is
indented.

> > +
> > + for_each_cpu_mask(cpu, domain->cpus) {
> > + per_cpu(hmp_cpu_domain, cpu) = domain;
> > + }
>
> Should use hmp_cpu_domain(cpu) here. Also no need of {} for single
> line loop.
>
> > + dc++;
>
> You aren't using it... Only for testing? Should we remove it from mainline
> patchset and keep it locally?
>

I'm using it in the pr_debug line a little earlier. It is used for
enumerating the hmp_domains.

> > + }
> > +
> > + return 1;
> > +}
> > +
> > +/*
> > + * Migration thresholds should be in the range [0..1023]
> > + * hmp_up_threshold: min. load required for migrating tasks to a faster cpu
> > + * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
> > + * The default values (512, 256) offer good responsiveness, but may need
> > + * tweaking suit particular needs.
> > + */
> > +unsigned int hmp_up_threshold = 512;
> > +unsigned int hmp_down_threshold = 256;
>
> For default values, it is fine. But still we should get user preferred
> values via DT
> or CONFIG_*.
>

Yes, but for now getting the scheduler to do the right thing has higher
priority than proper integration with DT.

> > +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
> > +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
> > +
> > +/* Check if cpu is in fastest hmp_domain */
> > +static inline unsigned int hmp_cpu_is_fastest(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return pos == hmp_domains.next;
>
> better create list_is_first() for this.
>

I had the same thought, but I see that as a separate patch that should
be submitted separately.

> > +}
> > +
> > +/* Check if cpu is in slowest hmp_domain */
> > +static inline unsigned int hmp_cpu_is_slowest(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_is_last(pos, &hmp_domains);
> > +}
> > +
> > +/* Next (slower) hmp_domain relative to cpu */
> > +static inline struct hmp_domain *hmp_slower_domain(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_entry(pos->next, struct hmp_domain, hmp_domains);
> > +}
> > +
> > +/* Previous (faster) hmp_domain relative to cpu */
> > +static inline struct hmp_domain *hmp_faster_domain(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_entry(pos->prev, struct hmp_domain, hmp_domains);
> > +}
>
> For all four routines, first two lines of body can be merged. If u wish :)
>

I have kept these helper functions fairly generic on purpose. It might
be necessary for multi-domain platforms (>2) to modify these functions
to implement better multi-domain task migration policies. I don't know
any such platform, so for know these functions are very simple.

> > +
> > +/*
> > + * Selects a cpu in previous (faster) hmp_domain
> > + * Note that cpumask_any_and() returns the first cpu in the cpumask
> > + */
> > +static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
> > + int cpu)
> > +{
> > + return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
> > + tsk_cpus_allowed(tsk));
> > +}
> > +
> > +/*
> > + * Selects a cpu in next (slower) hmp_domain
> > + * Note that cpumask_any_and() returns the first cpu in the cpumask
> > + */
> > +static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
> > + int cpu)
> > +{
> > + return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
> > + tsk_cpus_allowed(tsk));
> > +}
> > +
> > +#endif /* CONFIG_SCHED_HMP */
> > +
> > /*
> > * sched_balance_self: balance the current task (running on cpu) in domains
> > * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> > @@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> > unlock:
> > rcu_read_unlock();
> >
> > +#ifdef CONFIG_SCHED_HMP
> > + if (hmp_up_migration(prev_cpu, &p->se))
> > + return hmp_select_faster_cpu(p, prev_cpu);
> > + if (hmp_down_migration(prev_cpu, &p->se))
> > + return hmp_select_slower_cpu(p, prev_cpu);
> > + /* Make sure that the task stays in its previous hmp domain */
> > + if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))
>
> Why is this tested?
>

I don't think it is needed. It is there as an extra guarantee that
select_task_rq_fair() doesn't pick a cpu outside the task's current
hmp_domain in cases where there is no up or down migration. Disabling
SD_LOAD_BALANCE for the appropriate domains should give that guarantee.
I just haven't completely convinced myself yet.

> > + return prev_cpu;
> > +#endif
> > +
> > return new_cpu;
> > }
> >
> > @@ -5354,6 +5483,41 @@ need_kick:
> > static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
> > #endif
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +/* Check if task should migrate to a faster cpu */
> > +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se)
> > +{
> > + struct task_struct *p = task_of(se);
> > +
> > + if (hmp_cpu_is_fastest(cpu))
> > + return 0;
> > +
> > + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
> > + tsk_cpus_allowed(p))
> > + && se->avg.load_avg_ratio > hmp_up_threshold) {
> > + return 1;
> > + }
>
> I know all these comparisons are not very costly, still i would prefer
>
> se->avg.load_avg_ratio > hmp_up_threshold
>
> as the first comparison in this routine.
>
> We should check first, if the task needs migration or not. Rather than
> checking if it can migrate to other cpus or not.
>

Agree.

> > + return 0;
> > +}
> > +
> > +/* Check if task should migrate to a slower cpu */
> > +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se)
> > +{
> > + struct task_struct *p = task_of(se);
> > +
> > + if (hmp_cpu_is_slowest(cpu))
> > + return 0;
> > +
> > + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
> > + tsk_cpus_allowed(p))
> > + && se->avg.load_avg_ratio < hmp_down_threshold) {
> > + return 1;
> > + }
>
> same here.
>

Agree.

> > + return 0;
> > +}
> > +
> > +#endif /* CONFIG_SCHED_HMP */
> > +
> > /*
> > * run_rebalance_domains is triggered when needed from the scheduler tick.
> > * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
> > @@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void)
> > zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> > cpu_notifier(sched_ilb_notifier, 0);
> > #endif
> > +
> > +#ifdef CONFIG_SCHED_HMP
> > + hmp_cpu_mask_setup();
>
> Should we check the return value? If not required then should we
> make fn() declaration return void?
>

It can be changed to void if we don't add any error handling anyway.

> > +#endif
> > #endif /* SMP */
> >
> > }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 81135f9..4990d9e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
> >
> > extern int group_balance_cpu(struct sched_group *sg);
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +static LIST_HEAD(hmp_domains);
> > +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> > +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))
>
> can drop "()" around per_cpu().
>
> Both, per_cpu variable and macro to get it, have the same name. Can
> we try giving them better names. Or atleast add an "_" before per_cpu
> pointers name?
>

Yes.

> --
> viresh
>

2012-10-09 16:58:24

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

On 9 October 2012 21:26, Morten Rasmussen <[email protected]> wrote:
> On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
>> On 22 September 2012 00:02, <[email protected]> wrote:

>> > SCHED_HMP requires that the platform implements arch_get_hmp_domains()
>> > which should set up the platform specific list of hmp_domains. It is
>> > also assumed that the platform disables SD_LOAD_BALANCE for the
>> > appropriate sched_domains.
>>
>> An explanation of this requirement would be helpful here.
>
> Yes. This is to prevent the load-balancer from moving tasks between
> hmp_domains. This will be done exclusively by SCHED_HMP instead to
> implement a strict task migration policy and avoid changing the
> load-balancer behaviour. The load-balancer will take care of
> load-balacing within each hmp_domain.

Honestly speaking i understood this point now and earlier it wasn't clear
to me :)

What would be ideal is to put this information in the comment just before
we re-define other SCHED_*** domains where we disable balancing.
And keep it in the commit log too.

>> > +struct hmp_domain {
>> > + struct cpumask cpus;
>> > + struct list_head hmp_domains;
>>
>> Probably need a better name here. domain_list?
>
> Yes. hmp_domain_list would be better and stick with the hmp_* naming
> convention.

IMHO hmp_ would be better for global names, but names of variables
enclosed within another hmp_*** data type don't actually need hmp_**,
as this is implicity.

i.e.
struct hmp_domain {
struct cpumask cpus;
struct list_head domain_list;
}

would be better than
struct list_head hmp domain_list;

as the parent structure already have hmp_***. So whatever is inside the
struct is obviously hmp specific.

>> > +/* Setup hmp_domains */
>> > +static int __init hmp_cpu_mask_setup(void)
>>
>> How should we interpret its return value? Can you mention what does 0 & 1 mean
>> here?
>>
>
> Returns 0 if domain setup failed, i.e. the domain list is empty, and 1
> otherwise.

Helpful. Please mention this in function comment in your next revision.

>> > +{
>> > + char buf[64];
>> > + struct hmp_domain *domain;
>> > + struct list_head *pos;
>> > + int dc, cpu;

>> > + /* Print hmp_domains */
>> > + dc = 0;
>>
>> Should be done during definition of dc.

You missed this ??

>> > + for_each_cpu_mask(cpu, domain->cpus) {
>> > + per_cpu(hmp_cpu_domain, cpu) = domain;
>> > + }
>>
>> Should use hmp_cpu_domain(cpu) here. Also no need of {} for single
>> line loop.

??

>> > + dc++;
>>
>> You aren't using it... Only for testing? Should we remove it from mainline
>> patchset and keep it locally?
>>
>
> I'm using it in the pr_debug line a little earlier. It is used for
> enumerating the hmp_domains.

My mistake :(

>> > +/* Check if cpu is in fastest hmp_domain */
>> > +static inline unsigned int hmp_cpu_is_fastest(int cpu)
>> > +{
>> > + struct list_head *pos;
>> > +
>> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
>> > + return pos == hmp_domains.next;
>>
>> better create list_is_first() for this.
>
> I had the same thought, but I see that as a separate patch that should
> be submitted separately.

Correct. So better send it now, so that it is included before you send your
next version. :)

--
viresh