2020-09-11 15:48:45

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Signed-off-by: Aubrey Li <[email protected]>
---
include/linux/sched/topology.h | 13 +++++++++++++
kernel/sched/fair.c | 4 +++-
kernel/sched/topology.c | 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};

+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..3b6f8a3589be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

time = cpu_clock(this);

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;

atomic_inc(&sd->shared->nr_busy_cpus);
+ cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
@@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;

atomic_dec(&sd->shared->nr_busy_cpus);
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..92d0aeef86bf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

*per_cpu_ptr(sdd->sd, j) = sd;

- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.25.1


2020-09-11 16:30:14

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 09/10/20 13:42, Aubrey Li wrote:
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <[email protected]>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];

Can't you use cpumask_var_t and zalloc_cpumask_var() instead?

The patch looks useful. Did it help you with any particular workload? It'd be
good to expand on that in the commit message.

Thanks

--
Qais Yousef

2020-09-11 23:06:35

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 2020/9/12 0:28, Qais Yousef wrote:
> On 09/10/20 13:42, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Signed-off-by: Aubrey Li <[email protected]>
>> ---
>> include/linux/sched/topology.h | 13 +++++++++++++
>> kernel/sched/fair.c | 4 +++-
>> kernel/sched/topology.c | 2 +-
>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index fb11091129b3..43a641d26154 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> + /*
>> + * Span of all idle CPUs in this domain.
>> + *
>> + * NOTE: this field is variable length. (Allocated dynamically
>> + * by attaching extra space to the end of the structure,
>> + * depending on how many CPUs the kernel has booted up with)
>> + */
>> + unsigned long idle_cpus_span[];
>
> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?

I can use the existing free code. Do we have a problem of this?

>
> The patch looks useful. Did it help you with any particular workload? It'd be
> good to expand on that in the commit message.
>
Odd, that included in patch v1 0/1, did you receive it?

Thanks,
-Aubrey

2020-09-11 23:19:03

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 2020/9/12 7:04, Li, Aubrey wrote:
> On 2020/9/12 0:28, Qais Yousef wrote:
>> On 09/10/20 13:42, Aubrey Li wrote:
>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>> and when the CPU exits idle, its bit will be cleared.
>>>
>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>> has low cost than scanning all the cpus in last level cache domain,
>>> especially when the system is heavily loaded.
>>>
>>> Signed-off-by: Aubrey Li <[email protected]>
>>> ---
>>> include/linux/sched/topology.h | 13 +++++++++++++
>>> kernel/sched/fair.c | 4 +++-
>>> kernel/sched/topology.c | 2 +-
>>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>> index fb11091129b3..43a641d26154 100644
>>> --- a/include/linux/sched/topology.h
>>> +++ b/include/linux/sched/topology.h
>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>> atomic_t ref;
>>> atomic_t nr_busy_cpus;
>>> int has_idle_cores;
>>> + /*
>>> + * Span of all idle CPUs in this domain.
>>> + *
>>> + * NOTE: this field is variable length. (Allocated dynamically
>>> + * by attaching extra space to the end of the structure,
>>> + * depending on how many CPUs the kernel has booted up with)
>>> + */
>>> + unsigned long idle_cpus_span[];
>>
>> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>
> I can use the existing free code. Do we have a problem of this?
>
>>
>> The patch looks useful. Did it help you with any particular workload? It'd be
>> good to expand on that in the commit message.
>>
> Odd, that included in patch v1 0/1, did you receive it?

I found it at here:

https://lkml.org/lkml/2020/9/11/645

>
> Thanks,
> -Aubrey
>

2020-09-13 04:02:14

by Jiang Biao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

Hi, Aubrey

On Fri, 11 Sep 2020 at 23:48, Aubrey Li <[email protected]> wrote:
>
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <[email protected]>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..3b6f8a3589be 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
Is the sds_idle_cpus() always empty if nohz=off?
Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?

>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> sd->nohz_idle = 0;
>
> atomic_inc(&sd->shared->nr_busy_cpus);
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> sd->nohz_idle = 1;
>
> atomic_dec(&sd->shared->nr_busy_cpus);
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
This only works when entering/exiting tickless mode? :)
Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?

Thx.
Regards,
Jiang

> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..92d0aeef86bf 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.25.1
>

2020-09-14 10:33:29

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain


On 12/09/20 00:04, Li, Aubrey wrote:
>>> +++ b/include/linux/sched/topology.h
>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>> atomic_t ref;
>>> atomic_t nr_busy_cpus;
>>> int has_idle_cores;
>>> + /*
>>> + * Span of all idle CPUs in this domain.
>>> + *
>>> + * NOTE: this field is variable length. (Allocated dynamically
>>> + * by attaching extra space to the end of the structure,
>>> + * depending on how many CPUs the kernel has booted up with)
>>> + */
>>> + unsigned long idle_cpus_span[];
>>
>> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>
> I can use the existing free code. Do we have a problem of this?
>

Nah, flexible array members are the preferred approach here; this also
means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
allocated.

See struct numa_group, struct sched_group, struct sched_domain, struct
em_perf_domain...

>>
>> The patch looks useful. Did it help you with any particular workload? It'd be
>> good to expand on that in the commit message.
>>
> Odd, that included in patch v1 0/1, did you receive it?
>
> Thanks,
> -Aubrey

2020-09-14 11:12:45

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 09/14/20 11:31, Valentin Schneider wrote:
>
> On 12/09/20 00:04, Li, Aubrey wrote:
> >>> +++ b/include/linux/sched/topology.h
> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>> atomic_t ref;
> >>> atomic_t nr_busy_cpus;
> >>> int has_idle_cores;
> >>> + /*
> >>> + * Span of all idle CPUs in this domain.
> >>> + *
> >>> + * NOTE: this field is variable length. (Allocated dynamically
> >>> + * by attaching extra space to the end of the structure,
> >>> + * depending on how many CPUs the kernel has booted up with)
> >>> + */
> >>> + unsigned long idle_cpus_span[];
> >>
> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
> >
> > I can use the existing free code. Do we have a problem of this?
> >
>
> Nah, flexible array members are the preferred approach here; this also

Is this your opinion or a rule written somewhere I missed?

> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
> allocated.
>
> See struct numa_group, struct sched_group, struct sched_domain, struct
> em_perf_domain...

struct root_domain, struct cpupri_vec, struct generic_pm_domain,
struct irq_common_data..

Use cpumask_var_t.

Both approach look correct to me, so no objection in principle. cpumask_var_t
looks neater IMO and will be necessary once more than one cpumask are required
in a struct.

>
> >>
> >> The patch looks useful. Did it help you with any particular workload? It'd be
> >> good to expand on that in the commit message.
> >>
> > Odd, that included in patch v1 0/1, did you receive it?

Aubrey,

Sorry I didn't see that no. It's important justification to be part of the
commit message, I think worth adding it.

Thanks

--
Qais Yousef

2020-09-14 11:31:34

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain


On 14/09/20 12:08, Qais Yousef wrote:
> On 09/14/20 11:31, Valentin Schneider wrote:
>>
>> On 12/09/20 00:04, Li, Aubrey wrote:
>> >>> +++ b/include/linux/sched/topology.h
>> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> >>> atomic_t ref;
>> >>> atomic_t nr_busy_cpus;
>> >>> int has_idle_cores;
>> >>> + /*
>> >>> + * Span of all idle CPUs in this domain.
>> >>> + *
>> >>> + * NOTE: this field is variable length. (Allocated dynamically
>> >>> + * by attaching extra space to the end of the structure,
>> >>> + * depending on how many CPUs the kernel has booted up with)
>> >>> + */
>> >>> + unsigned long idle_cpus_span[];
>> >>
>> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>> >
>> > I can use the existing free code. Do we have a problem of this?
>> >
>>
>> Nah, flexible array members are the preferred approach here; this also
>
> Is this your opinion or a rule written somewhere I missed?

I don't think there's a written rule, but AIUI it is preferred by at
least Peter:

https://lore.kernel.org/linux-pm/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

And my opinion is that, if you can, having fewer separate allocation is better.

>
>> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
>> allocated.
>>
>> See struct numa_group, struct sched_group, struct sched_domain, struct
>> em_perf_domain...
>
> struct root_domain, struct cpupri_vec, struct generic_pm_domain,
> struct irq_common_data..
>
> Use cpumask_var_t.
>
> Both approach look correct to me, so no objection in principle. cpumask_var_t
> looks neater IMO and will be necessary once more than one cpumask are required
> in a struct.
>

You're right in that cpumask_var_t becomes necessary when you need more
than one mask. For those that use it despite requiring only one mask
(cpupri stuff, struct nohz too), I'm not sure.

2020-09-14 11:34:36

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 09/14/20 12:26, Valentin Schneider wrote:
>
> On 14/09/20 12:08, Qais Yousef wrote:
> > On 09/14/20 11:31, Valentin Schneider wrote:
> >>
> >> On 12/09/20 00:04, Li, Aubrey wrote:
> >> >>> +++ b/include/linux/sched/topology.h
> >> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >> >>> atomic_t ref;
> >> >>> atomic_t nr_busy_cpus;
> >> >>> int has_idle_cores;
> >> >>> + /*
> >> >>> + * Span of all idle CPUs in this domain.
> >> >>> + *
> >> >>> + * NOTE: this field is variable length. (Allocated dynamically
> >> >>> + * by attaching extra space to the end of the structure,
> >> >>> + * depending on how many CPUs the kernel has booted up with)
> >> >>> + */
> >> >>> + unsigned long idle_cpus_span[];
> >> >>
> >> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
> >> >
> >> > I can use the existing free code. Do we have a problem of this?
> >> >
> >>
> >> Nah, flexible array members are the preferred approach here; this also
> >
> > Is this your opinion or a rule written somewhere I missed?
>
> I don't think there's a written rule, but AIUI it is preferred by at
> least Peter:
>
> https://lore.kernel.org/linux-pm/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
> And my opinion is that, if you can, having fewer separate allocation is better.

+1

>
> >
> >> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
> >> allocated.
> >>
> >> See struct numa_group, struct sched_group, struct sched_domain, struct
> >> em_perf_domain...
> >
> > struct root_domain, struct cpupri_vec, struct generic_pm_domain,
> > struct irq_common_data..
> >
> > Use cpumask_var_t.
> >
> > Both approach look correct to me, so no objection in principle. cpumask_var_t
> > looks neater IMO and will be necessary once more than one cpumask are required
> > in a struct.
> >
>
> You're right in that cpumask_var_t becomes necessary when you need more
> than one mask. For those that use it despite requiring only one mask
> (cpupri stuff, struct nohz too), I'm not sure.

I don't have a strong opinoin. cpumask_var_t is more readable and maintainble
IMO. But it's not a big deal. Any form can be easily changed.

Thanks

--
Qais Yousef

2020-09-14 12:36:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On Fri, 11 Sep 2020 at 15:28, Aubrey Li <[email protected]> wrote:
>
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <[email protected]>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..3b6f8a3589be 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

sched_domain_shared is set only at shared cache level so this works
only because select_idle_cpu is called with sd_llc. It's worth adding
a comment to describe this.

>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> sd->nohz_idle = 0;
>
> atomic_inc(&sd->shared->nr_busy_cpus);
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> sd->nohz_idle = 1;
>
> atomic_dec(&sd->shared->nr_busy_cpus);
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..92d0aeef86bf 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.25.1
>

2020-09-14 17:27:05

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On Sun, 13 Sep 2020 at 05:59, Jiang Biao <[email protected]> wrote:
>
> Hi, Aubrey
>
> On Fri, 11 Sep 2020 at 23:48, Aubrey Li <[email protected]> wrote:
> >
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > Signed-off-by: Aubrey Li <[email protected]>
> > ---
> > include/linux/sched/topology.h | 13 +++++++++++++
> > kernel/sched/fair.c | 4 +++-
> > kernel/sched/topology.c | 2 +-
> > 3 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index fb11091129b3..43a641d26154 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > atomic_t ref;
> > atomic_t nr_busy_cpus;
> > int has_idle_cores;
> > + /*
> > + * Span of all idle CPUs in this domain.
> > + *
> > + * NOTE: this field is variable length. (Allocated dynamically
> > + * by attaching extra space to the end of the structure,
> > + * depending on how many CPUs the kernel has booted up with)
> > + */
> > + unsigned long idle_cpus_span[];
> > };
> >
> > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > +{
> > + return to_cpumask(sds->idle_cpus_span);
> > +}
> > +
> > struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6b3b59cc51d6..3b6f8a3589be 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >
> > time = cpu_clock(this);
> >
> > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> Is the sds_idle_cpus() always empty if nohz=off?

Good point

> Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > sd->nohz_idle = 0;
> >
> > atomic_inc(&sd->shared->nr_busy_cpus);
> > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > sd->nohz_idle = 1;
> >
> > atomic_dec(&sd->shared->nr_busy_cpus);
> > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> This only works when entering/exiting tickless mode? :)
> Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?

set_cpu_sd_state_busy is only called during a tick in order to limit
the rate of the update to once per tick per cpu at most and prevents
any kind of storm of update if short running tasks wake/sleep all the
time. We don't want to update a cpumask at each and every enter/leave
idle.

>
> Thx.
> Regards,
> Jiang
>
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9079d865a935..92d0aeef86bf 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >
> > *per_cpu_ptr(sdd->sd, j) = sd;
> >
> > - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> > + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> > GFP_KERNEL, cpu_to_node(j));
> > if (!sds)
> > return -ENOMEM;
> > --
> > 2.25.1
> >

2020-09-15 08:48:37

by Jiang Biao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

Hi, Vincent

On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
<[email protected]> wrote:
>
> On Sun, 13 Sep 2020 at 05:59, Jiang Biao <[email protected]> wrote:
> >
> > Hi, Aubrey
> >
> > On Fri, 11 Sep 2020 at 23:48, Aubrey Li <[email protected]> wrote:
> > >
> > > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > > enters idle, its corresponding bit in the idle cpumask will be set,
> > > and when the CPU exits idle, its bit will be cleared.
> > >
> > > When a task wakes up to select an idle cpu, scanning idle cpumask
> > > has low cost than scanning all the cpus in last level cache domain,
> > > especially when the system is heavily loaded.
> > >
> > > Signed-off-by: Aubrey Li <[email protected]>
> > > ---
> > > include/linux/sched/topology.h | 13 +++++++++++++
> > > kernel/sched/fair.c | 4 +++-
> > > kernel/sched/topology.c | 2 +-
> > > 3 files changed, 17 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index fb11091129b3..43a641d26154 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > > atomic_t ref;
> > > atomic_t nr_busy_cpus;
> > > int has_idle_cores;
> > > + /*
> > > + * Span of all idle CPUs in this domain.
> > > + *
> > > + * NOTE: this field is variable length. (Allocated dynamically
> > > + * by attaching extra space to the end of the structure,
> > > + * depending on how many CPUs the kernel has booted up with)
> > > + */
> > > + unsigned long idle_cpus_span[];
> > > };
> > >
> > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > > +{
> > > + return to_cpumask(sds->idle_cpus_span);
> > > +}
> > > +
> > > struct sched_domain {
> > > /* These fields must be setup */
> > > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6b3b59cc51d6..3b6f8a3589be 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > >
> > > time = cpu_clock(this);
> > >
> > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> > Is the sds_idle_cpus() always empty if nohz=off?
>
> Good point
>
> > Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
> >
> > >
> > > for_each_cpu_wrap(cpu, cpus, target) {
> > > if (!--nr)
> > > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > > sd->nohz_idle = 0;
> > >
> > > atomic_inc(&sd->shared->nr_busy_cpus);
> > > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > > unlock:
> > > rcu_read_unlock();
> > > }
> > > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > > sd->nohz_idle = 1;
> > >
> > > atomic_dec(&sd->shared->nr_busy_cpus);
> > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > This only works when entering/exiting tickless mode? :)
> > Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
>
> set_cpu_sd_state_busy is only called during a tick in order to limit
> the rate of the update to once per tick per cpu at most and prevents
> any kind of storm of update if short running tasks wake/sleep all the
> time. We don't want to update a cpumask at each and every enter/leave
> idle.
>
Agree. But set_cpu_sd_state_busy seems not being reached when
nohz=off, which means it will not work for that case? :)

Thx.
Regards,
Jiang

2020-09-15 09:24:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On Tue, 15 Sep 2020 at 10:47, Jiang Biao <[email protected]> wrote:
>
> Hi, Vincent
>
> On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
> <[email protected]> wrote:
> >
> > On Sun, 13 Sep 2020 at 05:59, Jiang Biao <[email protected]> wrote:
> > >
> > > Hi, Aubrey
> > >
> > > On Fri, 11 Sep 2020 at 23:48, Aubrey Li <[email protected]> wrote:
> > > >
> > > > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > > > enters idle, its corresponding bit in the idle cpumask will be set,
> > > > and when the CPU exits idle, its bit will be cleared.
> > > >
> > > > When a task wakes up to select an idle cpu, scanning idle cpumask
> > > > has low cost than scanning all the cpus in last level cache domain,
> > > > especially when the system is heavily loaded.
> > > >
> > > > Signed-off-by: Aubrey Li <[email protected]>
> > > > ---
> > > > include/linux/sched/topology.h | 13 +++++++++++++
> > > > kernel/sched/fair.c | 4 +++-
> > > > kernel/sched/topology.c | 2 +-
> > > > 3 files changed, 17 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > > index fb11091129b3..43a641d26154 100644
> > > > --- a/include/linux/sched/topology.h
> > > > +++ b/include/linux/sched/topology.h
> > > > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > > > atomic_t ref;
> > > > atomic_t nr_busy_cpus;
> > > > int has_idle_cores;
> > > > + /*
> > > > + * Span of all idle CPUs in this domain.
> > > > + *
> > > > + * NOTE: this field is variable length. (Allocated dynamically
> > > > + * by attaching extra space to the end of the structure,
> > > > + * depending on how many CPUs the kernel has booted up with)
> > > > + */
> > > > + unsigned long idle_cpus_span[];
> > > > };
> > > >
> > > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > > > +{
> > > > + return to_cpumask(sds->idle_cpus_span);
> > > > +}
> > > > +
> > > > struct sched_domain {
> > > > /* These fields must be setup */
> > > > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 6b3b59cc51d6..3b6f8a3589be 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > >
> > > > time = cpu_clock(this);
> > > >
> > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> > > Is the sds_idle_cpus() always empty if nohz=off?
> >
> > Good point
> >
> > > Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
> > >
> > > >
> > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > if (!--nr)
> > > > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > > > sd->nohz_idle = 0;
> > > >
> > > > atomic_inc(&sd->shared->nr_busy_cpus);
> > > > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > > > unlock:
> > > > rcu_read_unlock();
> > > > }
> > > > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > > > sd->nohz_idle = 1;
> > > >
> > > > atomic_dec(&sd->shared->nr_busy_cpus);
> > > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > > This only works when entering/exiting tickless mode? :)
> > > Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
> >
> > set_cpu_sd_state_busy is only called during a tick in order to limit
> > the rate of the update to once per tick per cpu at most and prevents
> > any kind of storm of update if short running tasks wake/sleep all the
> > time. We don't want to update a cpumask at each and every enter/leave
> > idle.
> >
> Agree. But set_cpu_sd_state_busy seems not being reached when
> nohz=off, which means it will not work for that case? :)

Yes set_cpu_sd_state_idle/busy are nohz function
>
> Thx.
> Regards,
> Jiang

2020-09-16 00:34:58

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain

On 2020/9/15 17:23, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 10:47, Jiang Biao <[email protected]> wrote:
>>
>> Hi, Vincent
>>
>> On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
>> <[email protected]> wrote:
>>>
>>> On Sun, 13 Sep 2020 at 05:59, Jiang Biao <[email protected]> wrote:
>>>>
>>>> Hi, Aubrey
>>>>
>>>> On Fri, 11 Sep 2020 at 23:48, Aubrey Li <[email protected]> wrote:
>>>>>
>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>>>> and when the CPU exits idle, its bit will be cleared.
>>>>>
>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>> especially when the system is heavily loaded.
>>>>>
>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>> ---
>>>>> include/linux/sched/topology.h | 13 +++++++++++++
>>>>> kernel/sched/fair.c | 4 +++-
>>>>> kernel/sched/topology.c | 2 +-
>>>>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>>> index fb11091129b3..43a641d26154 100644
>>>>> --- a/include/linux/sched/topology.h
>>>>> +++ b/include/linux/sched/topology.h
>>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>> atomic_t ref;
>>>>> atomic_t nr_busy_cpus;
>>>>> int has_idle_cores;
>>>>> + /*
>>>>> + * Span of all idle CPUs in this domain.
>>>>> + *
>>>>> + * NOTE: this field is variable length. (Allocated dynamically
>>>>> + * by attaching extra space to the end of the structure,
>>>>> + * depending on how many CPUs the kernel has booted up with)
>>>>> + */
>>>>> + unsigned long idle_cpus_span[];
>>>>> };
>>>>>
>>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>>>>> +{
>>>>> + return to_cpumask(sds->idle_cpus_span);
>>>>> +}
>>>>> +
>>>>> struct sched_domain {
>>>>> /* These fields must be setup */
>>>>> struct sched_domain __rcu *parent; /* top domain must be null terminated */
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 6b3b59cc51d6..3b6f8a3589be 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>
>>>>> time = cpu_clock(this);
>>>>>
>>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>>> Is the sds_idle_cpus() always empty if nohz=off?
>>>
>>> Good point
>>>
>>>> Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>>>>
>>>>>
>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>> if (!--nr)
>>>>> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
>>>>> sd->nohz_idle = 0;
>>>>>
>>>>> atomic_inc(&sd->shared->nr_busy_cpus);
>>>>> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>>>>> unlock:
>>>>> rcu_read_unlock();
>>>>> }
>>>>> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
>>>>> sd->nohz_idle = 1;
>>>>>
>>>>> atomic_dec(&sd->shared->nr_busy_cpus);
>>>>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> This only works when entering/exiting tickless mode? :)
>>>> Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
>>>
>>> set_cpu_sd_state_busy is only called during a tick in order to limit
>>> the rate of the update to once per tick per cpu at most and prevents
>>> any kind of storm of update if short running tasks wake/sleep all the
>>> time. We don't want to update a cpumask at each and every enter/leave
>>> idle.
>>>
>> Agree. But set_cpu_sd_state_busy seems not being reached when
>> nohz=off, which means it will not work for that case? :)
>
> Yes set_cpu_sd_state_idle/busy are nohz function

Thanks Biao to point this out.

If the shared idle cpumask is initialized with sched_domain_span(sd),
then nohz=off case will remain the previous behavior.

Thanks,
-Aubrey