Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.
When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.
The following benchmarks were tested on a x86 4 socket system with
24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
threads baseline-avg %std patch-avg %std
96 1 1.24 0.98 2.76
144 1 1.13 1.35 4.01
192 1 0.58 1.67 3.25
240 1 2.49 1.68 3.55
hackbench: process mode, 100000 loops, 40 file descriptors per group
group baseline-avg %std patch-avg %std
2(80) 1 12.05 0.97 9.88
3(120) 1 12.48 0.95 11.62
4(160) 1 13.83 0.97 13.22
5(200) 1 2.76 1.01 2.94
schbench: 99th percentile latency, 16 workers per message thread
mthread baseline-avg %std patch-avg %std
6(96) 1 1.24 0.993 1.73
9(144) 1 0.38 0.998 0.39
12(192) 1 1.58 0.995 1.64
15(240) 1 51.71 0.606 37.41
sysbench mysql throughput: read/write, table size = 10,000,000
thread baseline-avg %std patch-avg %std
96 1 1.77 1.015 1.71
144 1 3.39 0.998 4.05
192 1 2.88 1.002 2.81
240 1 2.07 1.011 2.09
kbuild: kexec reboot every time
baseline-avg patch-avg
1 1
v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
cpumask with sched_domain_span(sd), nohz=off case remains the original
behavior.
Cc: Qais Yousef <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Jiang Biao <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Aubrey Li <[email protected]>
---
include/linux/sched/topology.h | 13 +++++++++++++
kernel/sched/fair.c | 9 ++++++++-
kernel/sched/topology.c | 3 ++-
3 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..cfe78fcf69da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
time = cpu_clock(this);
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ /*
+ * sched_domain_shared is set only at shared cache level,
+ * this works only because select_idle_cpu is called with
+ * sd_llc.
+ */
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10187,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
atomic_inc(&sd->shared->nr_busy_cpus);
+ cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
@@ -10212,6 +10218,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
atomic_dec(&sd->shared->nr_busy_cpus);
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..f14a6ef4de57 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
}
sd->private = sdd;
@@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.25.1
On Wed, 16 Sep 2020 at 13:00, Mel Gorman <[email protected]> wrote:
>
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > The following benchmarks were tested on a x86 4 socket system with
> > 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >
>
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
>
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.
Yes, good point. I completely missed this
>
> --
> Mel Gorman
> SUSE Labs
On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> The following benchmarks were tested on a x86 4 socket system with
> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>
This still appears to be tied to turning the tick off. An idle CPU
available for computation does not necessarily have the tick turned off
if it's for short periods of time. When nohz is disabled or a machine is
active enough that CPUs are not disabling the tick, select_idle_cpu may
fail to select an idle CPU and instead stack tasks on the old CPU.
The other subtlety is that select_idle_sibling() currently allows a
SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
idle as such, it's simply running a low priority task that is suitable
for preemption. I suspect this patch breaks that.
--
Mel Gorman
SUSE Labs
On 16/09/20 12:00, Mel Gorman wrote:
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> The following benchmarks were tested on a x86 4 socket system with
>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>>
>
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
>
Vincent was pointing out in v1 that we ratelimit nohz_balance_exit_idle()
by having it happen on a tick to prevent being hammered by a flurry of
idle enter / exit sub tick granularity. I'm afraid flipping bits of this
cpumask on idle enter / exit might be too brutal.
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.
I think you're spot on.
An alternative I see here would be to move this into its own
select_idle_foo() function. If that mask is empty or none of the tagged
CPUs actually pass available_idle_cpu(), we fall-through to the usual idle
searches.
That's far from perfect; you could wake a truly idle CPU instead of
preempting a SCHED_IDLE task on a warm and busy CPU. I'm not sure if a
proliferation of cpumask really is the answer to that...
On 2020/9/16 19:00, Mel Gorman wrote:
> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> The following benchmarks were tested on a x86 4 socket system with
>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
>>
>
> This still appears to be tied to turning the tick off. An idle CPU
> available for computation does not necessarily have the tick turned off
> if it's for short periods of time. When nohz is disabled or a machine is
> active enough that CPUs are not disabling the tick, select_idle_cpu may
> fail to select an idle CPU and instead stack tasks on the old CPU.
>
> The other subtlety is that select_idle_sibling() currently allows a
> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> idle as such, it's simply running a low priority task that is suitable
> for preemption. I suspect this patch breaks that.
>
Thanks!
I shall post a v3 with performance data, I made a quick uperf testing and
found the benefit is still there. So I posted the patch here and looking
forward to your comments before I start the benchmarks.
Thanks,
-Aubrey
-----------------------------------------------------------------------
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..9a3c82645472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
}
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq)
+{
+ struct sched_domain *sd;
+ int cpu = cpu_of(rq);
+
+ rcu_read_lock();
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (!sd || !sd->shared)
+ goto unlock;
+ if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
+ goto unlock;
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+unlock:
+ rcu_read_unlock();
+}
+
/*
* Scan the entire LLC domain for idle cores; this dynamically switches off if
* there are no idle cores left in the system; tracked through
@@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
time = cpu_clock(this);
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ /*
+ * sched_domain_shared is set only at shared cache level,
+ * this works only because select_idle_cpu is called with
+ * sd_llc.
+ */
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (want_affine)
current->recent_used_cpu = cpu;
+
+ sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
+ if (sd && sd->shared)
+ cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
}
rcu_read_unlock();
@@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
+ /* Update idle cpumask if task has idle policy */
+ if (unlikely(task_has_idle_policy(p)))
+ update_idle_cpumask(rq);
}
void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1ae95b9150d3..876dfdfe35bb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
{
update_idle_core(rq);
+ update_idle_cpumask(rq);
schedstat_inc(rq->sched_goidle);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c82857e2e288..7a3355f61bcf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
#else
static inline void update_idle_core(struct rq *rq) { }
#endif
+void update_idle_cpumask(struct rq *rq);
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..f14a6ef4de57 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
}
sd->private = sdd;
@@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
On Thu, 17 Sep 2020 at 11:21, Li, Aubrey <[email protected]> wrote:
>
> On 2020/9/16 19:00, Mel Gorman wrote:
> > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >> enters idle, its corresponding bit in the idle cpumask will be set,
> >> and when the CPU exits idle, its bit will be cleared.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has low cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> The following benchmarks were tested on a x86 4 socket system with
> >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>
> >
> > This still appears to be tied to turning the tick off. An idle CPU
> > available for computation does not necessarily have the tick turned off
> > if it's for short periods of time. When nohz is disabled or a machine is
> > active enough that CPUs are not disabling the tick, select_idle_cpu may
> > fail to select an idle CPU and instead stack tasks on the old CPU.
> >
> > The other subtlety is that select_idle_sibling() currently allows a
> > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> > idle as such, it's simply running a low priority task that is suitable
> > for preemption. I suspect this patch breaks that.
> >
> Thanks!
>
> I shall post a v3 with performance data, I made a quick uperf testing and
> found the benefit is still there. So I posted the patch here and looking
> forward to your comments before I start the benchmarks.
>
> Thanks,
> -Aubrey
>
> -----------------------------------------------------------------------
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..9a3c82645472 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> rcu_read_unlock();
> }
>
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + */
> +void update_idle_cpumask(struct rq *rq)
> +{
> + struct sched_domain *sd;
> + int cpu = cpu_of(rq);
> +
> + rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> + if (!sd || !sd->shared)
> + goto unlock;
> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> + goto unlock;
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> +unlock:
> + rcu_read_unlock();
> +}
> +
> /*
> * Scan the entire LLC domain for idle cores; this dynamically switches off if
> * there are no idle cores left in the system; tracked through
> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + /*
> + * sched_domain_shared is set only at shared cache level,
> + * this works only because select_idle_cpu is called with
> + * sd_llc.
> + */
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
> if (want_affine)
> current->recent_used_cpu = cpu;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> + if (sd && sd->shared)
> + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
Why are you clearing the bit only for the fast path ? the slow path
can also select an idle CPU
Then, I'm afraid that updating a cpumask at each and every task wakeup
will be far too expensive. That's why we are ot updating
nohz.idle_cpus_mask at each and every enter/exit idle but only once
per tick.
And a quick test with hackbench on my octo cores arm64 gives for 12
iterations of: hackbench -l 2560 -g 1
tip/sched/core : 1.324(+/- 1.26%)
with this patch : 2.419(+/- 12.31%) -82% regression
> }
> rcu_read_unlock();
>
> @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> /* ensure bandwidth has been allocated on our new cfs_rq */
> account_cfs_rq_runtime(cfs_rq, 0);
> }
> + /* Update idle cpumask if task has idle policy */
> + if (unlikely(task_has_idle_policy(p)))
> + update_idle_cpumask(rq);
it's wrong because a sched_idle task will run for time to time even
when some cfs tasks are runnable
> }
>
> void init_cfs_rq(struct cfs_rq *cfs_rq)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 1ae95b9150d3..876dfdfe35bb 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> {
> update_idle_core(rq);
> + update_idle_cpumask(rq);
> schedstat_inc(rq->sched_goidle);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c82857e2e288..7a3355f61bcf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
> #else
> static inline void update_idle_core(struct rq *rq) { }
> #endif
> +void update_idle_cpumask(struct rq *rq);
>
> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..f14a6ef4de57 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> }
>
> sd->private = sdd;
> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
>
On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey <[email protected]> wrote:
> >
> > On 2020/9/16 19:00, Mel Gorman wrote:
> > > On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> > >> Added idle cpumask to track idle cpus in sched domain. When a CPU
> > >> enters idle, its corresponding bit in the idle cpumask will be set,
> > >> and when the CPU exits idle, its bit will be cleared.
> > >>
> > >> When a task wakes up to select an idle cpu, scanning idle cpumask
> > >> has low cost than scanning all the cpus in last level cache domain,
> > >> especially when the system is heavily loaded.
> > >>
> > >> The following benchmarks were tested on a x86 4 socket system with
> > >> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> > >>
> > >
> > > This still appears to be tied to turning the tick off. An idle CPU
> > > available for computation does not necessarily have the tick turned off
> > > if it's for short periods of time. When nohz is disabled or a machine is
> > > active enough that CPUs are not disabling the tick, select_idle_cpu may
> > > fail to select an idle CPU and instead stack tasks on the old CPU.
> > >
> > > The other subtlety is that select_idle_sibling() currently allows a
> > > SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> > > idle as such, it's simply running a low priority task that is suitable
> > > for preemption. I suspect this patch breaks that.
> > >
> > Thanks!
> >
> > I shall post a v3 with performance data, I made a quick uperf testing and
> > found the benefit is still there. So I posted the patch here and looking
> > forward to your comments before I start the benchmarks.
> >
> > Thanks,
> > -Aubrey
> >
> > -----------------------------------------------------------------------
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index fb11091129b3..43a641d26154 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > atomic_t ref;
> > atomic_t nr_busy_cpus;
> > int has_idle_cores;
> > + /*
> > + * Span of all idle CPUs in this domain.
> > + *
> > + * NOTE: this field is variable length. (Allocated dynamically
> > + * by attaching extra space to the end of the structure,
> > + * depending on how many CPUs the kernel has booted up with)
> > + */
> > + unsigned long idle_cpus_span[];
> > };
> >
> > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > +{
> > + return to_cpumask(sds->idle_cpus_span);
> > +}
> > +
> > struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6b3b59cc51d6..9a3c82645472 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> > rcu_read_unlock();
> > }
> >
> > +/*
> > + * Update cpu idle state and record this information
> > + * in sd_llc_shared->idle_cpus_span.
> > + */
> > +void update_idle_cpumask(struct rq *rq)
> > +{
> > + struct sched_domain *sd;
> > + int cpu = cpu_of(rq);
> > +
> > + rcu_read_lock();
> > + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > + if (!sd || !sd->shared)
> > + goto unlock;
> > + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> > + goto unlock;
> > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > +unlock:
> > + rcu_read_unlock();
> > +}
> > +
> > /*
> > * Scan the entire LLC domain for idle cores; this dynamically switches off if
> > * there are no idle cores left in the system; tracked through
> > @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >
> > time = cpu_clock(this);
> >
> > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > + /*
> > + * sched_domain_shared is set only at shared cache level,
> > + * this works only because select_idle_cpu is called with
> > + * sd_llc.
> > + */
> > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >
> > if (want_affine)
> > current->recent_used_cpu = cpu;
> > +
> > + sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> > + if (sd && sd->shared)
> > + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
>
> Why are you clearing the bit only for the fast path ? the slow path
> can also select an idle CPU
>
> Then, I'm afraid that updating a cpumask at each and every task wakeup
> will be far too expensive. That's why we are ot updating
That's why we are not updating
> nohz.idle_cpus_mask at each and every enter/exit idle but only once
> per tick.
>
> And a quick test with hackbench on my octo cores arm64 gives for 12
> iterations of: hackbench -l 2560 -g 1
> tip/sched/core : 1.324(+/- 1.26%)
> with this patch : 2.419(+/- 12.31%) -82% regression
>
> > }
> > rcu_read_unlock();
> >
> > @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> > /* ensure bandwidth has been allocated on our new cfs_rq */
> > account_cfs_rq_runtime(cfs_rq, 0);
> > }
> > + /* Update idle cpumask if task has idle policy */
> > + if (unlikely(task_has_idle_policy(p)))
> > + update_idle_cpumask(rq);
>
> it's wrong because a sched_idle task will run for time to time even
> when some cfs tasks are runnable
>
> > }
> >
> > void init_cfs_rq(struct cfs_rq *cfs_rq)
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 1ae95b9150d3..876dfdfe35bb 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> > static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> > {
> > update_idle_core(rq);
> > + update_idle_cpumask(rq);
> > schedstat_inc(rq->sched_goidle);
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index c82857e2e288..7a3355f61bcf 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
> > #else
> > static inline void update_idle_core(struct rq *rq) { }
> > #endif
> > +void update_idle_cpumask(struct rq *rq);
> >
> > DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9079d865a935..f14a6ef4de57 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> > atomic_inc(&sd->shared->ref);
> > atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> > + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> > }
> >
> > sd->private = sdd;
> > @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >
> > *per_cpu_ptr(sdd->sd, j) = sd;
> >
> > - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> > + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> > GFP_KERNEL, cpu_to_node(j));
> > if (!sds)
> > return -ENOMEM;
> >
On Tue, 22 Sep 2020 at 05:33, Li, Aubrey <[email protected]> wrote:
>
> On 2020/9/21 23:21, Vincent Guittot wrote:
> > On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> > <[email protected]> wrote:
> >>
> >> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey <[email protected]> wrote:
> >>>
> >>> On 2020/9/16 19:00, Mel Gorman wrote:
> >>>> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>> enters idle, its corresponding bit in the idle cpumask will be set,
> >>>>> and when the CPU exits idle, its bit will be cleared.
> >>>>>
> >>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>> especially when the system is heavily loaded.
> >>>>>
> >>>>> The following benchmarks were tested on a x86 4 socket system with
> >>>>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>>>>
> >>>>
> >>>> This still appears to be tied to turning the tick off. An idle CPU
> >>>> available for computation does not necessarily have the tick turned off
> >>>> if it's for short periods of time. When nohz is disabled or a machine is
> >>>> active enough that CPUs are not disabling the tick, select_idle_cpu may
> >>>> fail to select an idle CPU and instead stack tasks on the old CPU.
> >>>>
> >>>> The other subtlety is that select_idle_sibling() currently allows a
> >>>> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> >>>> idle as such, it's simply running a low priority task that is suitable
> >>>> for preemption. I suspect this patch breaks that.
> >>>>
> >>> Thanks!
> >>>
> >>> I shall post a v3 with performance data, I made a quick uperf testing and
> >>> found the benefit is still there. So I posted the patch here and looking
> >>> forward to your comments before I start the benchmarks.
> >>>
> >>> Thanks,
> >>> -Aubrey
> >>>
> >>> -----------------------------------------------------------------------
> >>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >>> index fb11091129b3..43a641d26154 100644
> >>> --- a/include/linux/sched/topology.h
> >>> +++ b/include/linux/sched/topology.h
> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>> atomic_t ref;
> >>> atomic_t nr_busy_cpus;
> >>> int has_idle_cores;
> >>> + /*
> >>> + * Span of all idle CPUs in this domain.
> >>> + *
> >>> + * NOTE: this field is variable length. (Allocated dynamically
> >>> + * by attaching extra space to the end of the structure,
> >>> + * depending on how many CPUs the kernel has booted up with)
> >>> + */
> >>> + unsigned long idle_cpus_span[];
> >>> };
> >>>
> >>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >>> +{
> >>> + return to_cpumask(sds->idle_cpus_span);
> >>> +}
> >>> +
> >>> struct sched_domain {
> >>> /* These fields must be setup */
> >>> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 6b3b59cc51d6..9a3c82645472 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> >>> rcu_read_unlock();
> >>> }
> >>>
> >>> +/*
> >>> + * Update cpu idle state and record this information
> >>> + * in sd_llc_shared->idle_cpus_span.
> >>> + */
> >>> +void update_idle_cpumask(struct rq *rq)
> >>> +{
> >>> + struct sched_domain *sd;
> >>> + int cpu = cpu_of(rq);
> >>> +
> >>> + rcu_read_lock();
> >>> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >>> + if (!sd || !sd->shared)
> >>> + goto unlock;
> >>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> >>> + goto unlock;
> >>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >>> +unlock:
> >>> + rcu_read_unlock();
> >>> +}
> >>> +
> >>> /*
> >>> * Scan the entire LLC domain for idle cores; this dynamically switches off if
> >>> * there are no idle cores left in the system; tracked through
> >>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>>
> >>> time = cpu_clock(this);
> >>>
> >>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>> + /*
> >>> + * sched_domain_shared is set only at shared cache level,
> >>> + * this works only because select_idle_cpu is called with
> >>> + * sd_llc.
> >>> + */
> >>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>>
> >>> for_each_cpu_wrap(cpu, cpus, target) {
> >>> if (!--nr)
> >>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>>
> >>> if (want_affine)
> >>> current->recent_used_cpu = cpu;
> >>> +
> >>> + sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> >>> + if (sd && sd->shared)
> >>> + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
> >>
> >> Why are you clearing the bit only for the fast path ? the slow path
> >> can also select an idle CPU
>
> Right, I saw idle core searching is turned off in the fast path only too,
> because next wakeup we'll check if the CPU is idle, this only affects the
> idle cpu searching span.
>
> >>
> >> Then, I'm afraid that updating a cpumask at each and every task wakeup
> >> will be far too expensive. That's why we are ot updating
> >
> > That's why we are not updating
>
> AFAIK, uperf/netperf is the workload with bunches of short idles, do you
> have any other workloads in your mind? I can measure to verify this.
> >
> >> nohz.idle_cpus_mask at each and every enter/exit idle but only once
> >> per tick.
> Yes, agreed, need more think about this, especially if the data is really
> bad.
>
> >>
> >> And a quick test with hackbench on my octo cores arm64 gives for 12
> >> iterations of: hackbench -l 2560 -g 1
> >> tip/sched/core : 1.324(+/- 1.26%)
> >> with this patch : 2.419(+/- 12.31%) -82% regression
>
> Can you please clarify this, is this running 2560 loops and 1 group?
yes it's 2560 loops and 1 group and I run 12 times the bench:
$ hackbench -l 2560 -g 1
Running in process mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 2560 messages of 100 bytes
Time: 2.953
you can also have a look at perf sched pipe
tip/sched/core
$ perf bench sched pipe -T -l 50000
# Running 'sched/pipe' benchmark:
# Executed 50000 pipe operations between two threads
Total time: 0.980 [sec]
19.609160 usecs/op
50996 ops/sec
With your patch :
$ perf bench sched pipe -T -l 50000
# Running 'sched/pipe' benchmark:
# Executed 50000 pipe operations between two threads
Total time: 1.283 [sec]
25.674200 usecs/op
38949 ops/sec
which is a 23% regression
> 10 iterations "./hackbench 1 process 2560" on my side are:
>
> 5.8.10: 0.14(+/- 12.01%)
> =========================
> [0.089, 0.148, 0.147, 0.141, 0.143, 0.143, 0.143, 0.146, 0.143, 0.142]
> Score: avg - 0.1385, std - 12.01%
>
> With this patch
> ================
> [0.095, 0.142, 0.143, 0.142, 0.15, 0.146, 0.144, 0.145, 0.143, 0.145]
> Score: avg - 0.1395, std - 10.88%
>
> I didn't see such big regression.
>
> >>
> >>> }
> >>> rcu_read_unlock();
> >>>
> >>> @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>> /* ensure bandwidth has been allocated on our new cfs_rq */
> >>> account_cfs_rq_runtime(cfs_rq, 0);
> >>> }
> >>> + /* Update idle cpumask if task has idle policy */
> >>> + if (unlikely(task_has_idle_policy(p)))
> >>> + update_idle_cpumask(rq);
> >>
> >> it's wrong because a sched_idle task will run for time to time even
> >> when some cfs tasks are runnable
> >>
> Sorry I didn't get your point. The intention here is to add a SCHED_IDLE cpu to the idle cpumask,
> so that this cpu can be used as a target for wakeup preemption.
a cpu with sched_idle tasks can be considered idle iff there is only
sched_idle tasks runnable. Look at sched_idle_cpu()
>
> >>> }
> >>>
> >>> void init_cfs_rq(struct cfs_rq *cfs_rq)
> >>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >>> index 1ae95b9150d3..876dfdfe35bb 100644
> >>> --- a/kernel/sched/idle.c
> >>> +++ b/kernel/sched/idle.c
> >>> @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> >>> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> >>> {
> >>> update_idle_core(rq);
> >>> + update_idle_cpumask(rq);
> >>> schedstat_inc(rq->sched_goidle);
> >>> }
> >>>
> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>> index c82857e2e288..7a3355f61bcf 100644
> >>> --- a/kernel/sched/sched.h
> >>> +++ b/kernel/sched/sched.h
> >>> @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
> >>> #else
> >>> static inline void update_idle_core(struct rq *rq) { }
> >>> #endif
> >>> +void update_idle_cpumask(struct rq *rq);
> >>>
> >>> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >>>
> >>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >>> index 9079d865a935..f14a6ef4de57 100644
> >>> --- a/kernel/sched/topology.c
> >>> +++ b/kernel/sched/topology.c
> >>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >>> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >>> atomic_inc(&sd->shared->ref);
> >>> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> >>> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> >>> }
> >>>
> >>> sd->private = sdd;
> >>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >>>
> >>> *per_cpu_ptr(sdd->sd, j) = sd;
> >>>
> >>> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> >>> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> >>> GFP_KERNEL, cpu_to_node(j));
> >>> if (!sds)
> >>> return -ENOMEM;
> >>>
>
On Wed, 23 Sep 2020 at 04:59, Li, Aubrey <[email protected]> wrote:
>
> Hi Vincent,
>
> On 2020/9/22 15:14, Vincent Guittot wrote:
> > On Tue, 22 Sep 2020 at 05:33, Li, Aubrey <[email protected]> wrote:
> >>
> >> On 2020/9/21 23:21, Vincent Guittot wrote:
> >>> On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> >>> <[email protected]> wrote:
> >>>>
> >>>> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey <[email protected]> wrote:
> >>>>>
> >>>>> On 2020/9/16 19:00, Mel Gorman wrote:
> >>>>>> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >>>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>>>> enters idle, its corresponding bit in the idle cpumask will be set,
> >>>>>>> and when the CPU exits idle, its bit will be cleared.
> >>>>>>>
> >>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>>>> especially when the system is heavily loaded.
> >>>>>>>
> >>>>>>> The following benchmarks were tested on a x86 4 socket system with
> >>>>>>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>>>>>>
> >>>>>>
> >>>>>> This still appears to be tied to turning the tick off. An idle CPU
> >>>>>> available for computation does not necessarily have the tick turned off
> >>>>>> if it's for short periods of time. When nohz is disabled or a machine is
> >>>>>> active enough that CPUs are not disabling the tick, select_idle_cpu may
> >>>>>> fail to select an idle CPU and instead stack tasks on the old CPU.
> >>>>>>
> >>>>>> The other subtlety is that select_idle_sibling() currently allows a
> >>>>>> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> >>>>>> idle as such, it's simply running a low priority task that is suitable
> >>>>>> for preemption. I suspect this patch breaks that.
> >>>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> I shall post a v3 with performance data, I made a quick uperf testing and
> >>>>> found the benefit is still there. So I posted the patch here and looking
> >>>>> forward to your comments before I start the benchmarks.
> >>>>>
> >>>>> Thanks,
> >>>>> -Aubrey
> >>>>>
> >>>>> -----------------------------------------------------------------------
> >>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >>>>> index fb11091129b3..43a641d26154 100644
> >>>>> --- a/include/linux/sched/topology.h
> >>>>> +++ b/include/linux/sched/topology.h
> >>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>>>> atomic_t ref;
> >>>>> atomic_t nr_busy_cpus;
> >>>>> int has_idle_cores;
> >>>>> + /*
> >>>>> + * Span of all idle CPUs in this domain.
> >>>>> + *
> >>>>> + * NOTE: this field is variable length. (Allocated dynamically
> >>>>> + * by attaching extra space to the end of the structure,
> >>>>> + * depending on how many CPUs the kernel has booted up with)
> >>>>> + */
> >>>>> + unsigned long idle_cpus_span[];
> >>>>> };
> >>>>>
> >>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >>>>> +{
> >>>>> + return to_cpumask(sds->idle_cpus_span);
> >>>>> +}
> >>>>> +
> >>>>> struct sched_domain {
> >>>>> /* These fields must be setup */
> >>>>> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>> index 6b3b59cc51d6..9a3c82645472 100644
> >>>>> --- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> >>>>> rcu_read_unlock();
> >>>>> }
> >>>>>
> >>>>> +/*
> >>>>> + * Update cpu idle state and record this information
> >>>>> + * in sd_llc_shared->idle_cpus_span.
> >>>>> + */
> >>>>> +void update_idle_cpumask(struct rq *rq)
> >>>>> +{
> >>>>> + struct sched_domain *sd;
> >>>>> + int cpu = cpu_of(rq);
> >>>>> +
> >>>>> + rcu_read_lock();
> >>>>> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >>>>> + if (!sd || !sd->shared)
> >>>>> + goto unlock;
> >>>>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> >>>>> + goto unlock;
> >>>>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >>>>> +unlock:
> >>>>> + rcu_read_unlock();
> >>>>> +}
> >>>>> +
> >>>>> /*
> >>>>> * Scan the entire LLC domain for idle cores; this dynamically switches off if
> >>>>> * there are no idle cores left in the system; tracked through
> >>>>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>>>>
> >>>>> time = cpu_clock(this);
> >>>>>
> >>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>>>> + /*
> >>>>> + * sched_domain_shared is set only at shared cache level,
> >>>>> + * this works only because select_idle_cpu is called with
> >>>>> + * sd_llc.
> >>>>> + */
> >>>>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>>>>
> >>>>> for_each_cpu_wrap(cpu, cpus, target) {
> >>>>> if (!--nr)
> >>>>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>>>>
> >>>>> if (want_affine)
> >>>>> current->recent_used_cpu = cpu;
> >>>>> +
> >>>>> + sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> >>>>> + if (sd && sd->shared)
> >>>>> + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
> >>>>
> >>>> Why are you clearing the bit only for the fast path ? the slow path
> >>>> can also select an idle CPU
> >>
> >> Right, I saw idle core searching is turned off in the fast path only too,
> >> because next wakeup we'll check if the CPU is idle, this only affects the
> >> idle cpu searching span.
> >>
> >>>>
> >>>> Then, I'm afraid that updating a cpumask at each and every task wakeup
> >>>> will be far too expensive. That's why we are ot updating
> >>>
> >>> That's why we are not updating
> >>
> >> AFAIK, uperf/netperf is the workload with bunches of short idles, do you
> >> have any other workloads in your mind? I can measure to verify this.
> >>>
> >>>> nohz.idle_cpus_mask at each and every enter/exit idle but only once
> >>>> per tick.
> >> Yes, agreed, need more think about this, especially if the data is really
> >> bad.
> >>
> >>>>
> >>>> And a quick test with hackbench on my octo cores arm64 gives for 12
> >>>> iterations of: hackbench -l 2560 -g 1
> >>>> tip/sched/core : 1.324(+/- 1.26%)
> >>>> with this patch : 2.419(+/- 12.31%) -82% regression
> >>
> >> Can you please clarify this, is this running 2560 loops and 1 group?
> >
> > yes it's 2560 loops and 1 group and I run 12 times the bench:
> > $ hackbench -l 2560 -g 1
> > Running in process mode with 1 groups using 40 file descriptors each
> > (== 40 tasks)
> > Each sender will pass 2560 messages of 100 bytes
> > Time: 2.953
> >
> > you can also have a look at perf sched pipe
> > tip/sched/core
> > $ perf bench sched pipe -T -l 50000
> > # Running 'sched/pipe' benchmark:
> > # Executed 50000 pipe operations between two threads
> >
> > Total time: 0.980 [sec]
> >
> > 19.609160 usecs/op
> > 50996 ops/sec
> >
> > With your patch :
> > $ perf bench sched pipe -T -l 50000
> > # Running 'sched/pipe' benchmark:
> > # Executed 50000 pipe operations between two threads
> >
> > Total time: 1.283 [sec]
> >
> > 25.674200 usecs/op
> > 38949 ops/sec
> >
> > which is a 23% regression
>
> This is interesting, 10 iterations data on my side:
>
> $ cat upstream.log | grep Total
> Total time: 0.799 [sec]
> Total time: 0.513 [sec]
> Total time: 0.658 [sec]
> Total time: 0.499 [sec]
> Total time: 0.644 [sec]
> Total time: 0.758 [sec]
> Total time: 0.576 [sec]
> Total time: 0.864 [sec]
> Total time: 0.566 [sec]
> Total time: 0.694 [sec]
> Total time: 0.609 [sec]
>
> $ cat idle_cpumask.log | grep Total
> Total time: 0.595 [sec]
> Total time: 0.519 [sec]
> Total time: 0.592 [sec]
> Total time: 0.504 [sec]
> Total time: 0.440 [sec]
> Total time: 0.676 [sec]
> Total time: 0.683 [sec]
> Total time: 0.437 [sec]
> Total time: 0.460 [sec]
> Total time: 0.473 [sec]
> Total time: 0.478 [sec]
>
> That is, there is a 18.53% improvement.
> ========================================
> cases avg std
> 5.8.10 0.653 17.958
> idle_cpumask 0.532 16.916
>
> Is this caused by the architecture specific atomic operations?
> I also noticed I have shorter total time of both hackbench and perf sched
> pipe.
TBH, I'm surprised that perf sched pipe is shorter with your patch. It
adds more code to run but don't take advantage of a shorter cpumask in
select_idle_cpu() because it either skips select_idle_cpu() if
previous cpu is idle which should be the case in this light loaded
benchmark or the cpumask should be quite the same as the sched_domain
span
The sequence is :
select_task_rq_fair()
select prev which should be idle
+cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
pick_next_task_fair()
set_next_task_fair()
+unlikely update_idle_cpumask(rq);
wake up other side
go back to sleep
put_prev_task_fair
pick_next_task_idle
set_next_task_idle
+update_idle_cpumask(rq)
>
> Would you mind share uperf(netperf load) result on your side? That's the
> workload I have seen the most benefit this patch contributed under heavy
> load level.
with uperf, i've got the same kind of result as sched pipe
tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
regression as for sched pipe
>
> Maybe I should look for the architecture specific implementation? Any clue?
>
> >
> >> 10 iterations "./hackbench 1 process 2560" on my side are:
> >>
> >> 5.8.10: 0.14(+/- 12.01%)
> >> =========================
> >> [0.089, 0.148, 0.147, 0.141, 0.143, 0.143, 0.143, 0.146, 0.143, 0.142]
> >> Score: avg - 0.1385, std - 12.01%
> >>
> >> With this patch
> >> ================
> >> [0.095, 0.142, 0.143, 0.142, 0.15, 0.146, 0.144, 0.145, 0.143, 0.145]
> >> Score: avg - 0.1395, std - 10.88%
> >>
> >> I didn't see such big regression.
> >>
> >>>>
> >>>>> }
> >>>>> rcu_read_unlock();
> >>>>>
> >>>>> @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>>>> /* ensure bandwidth has been allocated on our new cfs_rq */
> >>>>> account_cfs_rq_runtime(cfs_rq, 0);
> >>>>> }
> >>>>> + /* Update idle cpumask if task has idle policy */
> >>>>> + if (unlikely(task_has_idle_policy(p)))
> >>>>> + update_idle_cpumask(rq);
> >>>>
> >>>> it's wrong because a sched_idle task will run for time to time even
> >>>> when some cfs tasks are runnable
> >>>>
> >> Sorry I didn't get your point. The intention here is to add a SCHED_IDLE cpu to the idle cpumask,
> >> so that this cpu can be used as a target for wakeup preemption.
> >
> > a cpu with sched_idle tasks can be considered idle iff there is only
> > sched_idle tasks runnable. Look at sched_idle_cpu()
> >
> You are right, thanks to point this out.
>
> Thanks,
> -Aubrey
> >>
> >>>>> }
> >>>>>
> >>>>> void init_cfs_rq(struct cfs_rq *cfs_rq)
> >>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >>>>> index 1ae95b9150d3..876dfdfe35bb 100644
> >>>>> --- a/kernel/sched/idle.c
> >>>>> +++ b/kernel/sched/idle.c
> >>>>> @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> >>>>> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> >>>>> {
> >>>>> update_idle_core(rq);
> >>>>> + update_idle_cpumask(rq);
> >>>>> schedstat_inc(rq->sched_goidle);
> >>>>> }
> >>>>>
> >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>>> index c82857e2e288..7a3355f61bcf 100644
> >>>>> --- a/kernel/sched/sched.h
> >>>>> +++ b/kernel/sched/sched.h
> >>>>> @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
> >>>>> #else
> >>>>> static inline void update_idle_core(struct rq *rq) { }
> >>>>> #endif
> >>>>> +void update_idle_cpumask(struct rq *rq);
> >>>>>
> >>>>> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >>>>>
> >>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >>>>> index 9079d865a935..f14a6ef4de57 100644
> >>>>> --- a/kernel/sched/topology.c
> >>>>> +++ b/kernel/sched/topology.c
> >>>>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >>>>> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >>>>> atomic_inc(&sd->shared->ref);
> >>>>> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> >>>>> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> >>>>> }
> >>>>>
> >>>>> sd->private = sdd;
> >>>>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >>>>>
> >>>>> *per_cpu_ptr(sdd->sd, j) = sd;
> >>>>>
> >>>>> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> >>>>> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> >>>>> GFP_KERNEL, cpu_to_node(j));
> >>>>> if (!sds)
> >>>>> return -ENOMEM;
> >>>>>
> >>
>
On Thu, 24 Sep 2020 at 05:04, Li, Aubrey <[email protected]> wrote:
>
> On 2020/9/23 16:50, Vincent Guittot wrote:
> > On Wed, 23 Sep 2020 at 04:59, Li, Aubrey <[email protected]> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 2020/9/22 15:14, Vincent Guittot wrote:
> >>> On Tue, 22 Sep 2020 at 05:33, Li, Aubrey <[email protected]> wrote:
> >>>>
> >>>> On 2020/9/21 23:21, Vincent Guittot wrote:
> >>>>> On Mon, 21 Sep 2020 at 17:14, Vincent Guittot
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 2020/9/16 19:00, Mel Gorman wrote:
> >>>>>>>> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote:
> >>>>>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>>>>>> enters idle, its corresponding bit in the idle cpumask will be set,
> >>>>>>>>> and when the CPU exits idle, its bit will be cleared.
> >>>>>>>>>
> >>>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>>>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>>>>>> especially when the system is heavily loaded.
> >>>>>>>>>
> >>>>>>>>> The following benchmarks were tested on a x86 4 socket system with
> >>>>>>>>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs:
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> This still appears to be tied to turning the tick off. An idle CPU
> >>>>>>>> available for computation does not necessarily have the tick turned off
> >>>>>>>> if it's for short periods of time. When nohz is disabled or a machine is
> >>>>>>>> active enough that CPUs are not disabling the tick, select_idle_cpu may
> >>>>>>>> fail to select an idle CPU and instead stack tasks on the old CPU.
> >>>>>>>>
> >>>>>>>> The other subtlety is that select_idle_sibling() currently allows a
> >>>>>>>> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really
> >>>>>>>> idle as such, it's simply running a low priority task that is suitable
> >>>>>>>> for preemption. I suspect this patch breaks that.
> >>>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> I shall post a v3 with performance data, I made a quick uperf testing and
> >>>>>>> found the benefit is still there. So I posted the patch here and looking
> >>>>>>> forward to your comments before I start the benchmarks.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> -Aubrey
> >>>>>>>
> >>>>>>> -----------------------------------------------------------------------
> >>>>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >>>>>>> index fb11091129b3..43a641d26154 100644
> >>>>>>> --- a/include/linux/sched/topology.h
> >>>>>>> +++ b/include/linux/sched/topology.h
> >>>>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>>>>>> atomic_t ref;
> >>>>>>> atomic_t nr_busy_cpus;
> >>>>>>> int has_idle_cores;
> >>>>>>> + /*
> >>>>>>> + * Span of all idle CPUs in this domain.
> >>>>>>> + *
> >>>>>>> + * NOTE: this field is variable length. (Allocated dynamically
> >>>>>>> + * by attaching extra space to the end of the structure,
> >>>>>>> + * depending on how many CPUs the kernel has booted up with)
> >>>>>>> + */
> >>>>>>> + unsigned long idle_cpus_span[];
> >>>>>>> };
> >>>>>>>
> >>>>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >>>>>>> +{
> >>>>>>> + return to_cpumask(sds->idle_cpus_span);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> struct sched_domain {
> >>>>>>> /* These fields must be setup */
> >>>>>>> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> >>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>>> index 6b3b59cc51d6..9a3c82645472 100644
> >>>>>>> --- a/kernel/sched/fair.c
> >>>>>>> +++ b/kernel/sched/fair.c
> >>>>>>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq)
> >>>>>>> rcu_read_unlock();
> >>>>>>> }
> >>>>>>>
> >>>>>>> +/*
> >>>>>>> + * Update cpu idle state and record this information
> >>>>>>> + * in sd_llc_shared->idle_cpus_span.
> >>>>>>> + */
> >>>>>>> +void update_idle_cpumask(struct rq *rq)
> >>>>>>> +{
> >>>>>>> + struct sched_domain *sd;
> >>>>>>> + int cpu = cpu_of(rq);
> >>>>>>> +
> >>>>>>> + rcu_read_lock();
> >>>>>>> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >>>>>>> + if (!sd || !sd->shared)
> >>>>>>> + goto unlock;
> >>>>>>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> >>>>>>> + goto unlock;
>
> Oops, I realized I didn't send an update out to fix this while I fixed
> it locally. it should be
>
> if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
the fix doesn't change the perf results
>
> Sorry for this, Vincent, :(
>
> >>>>>>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >>>>>>> +unlock:
> >>>>>>> + rcu_read_unlock();
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> /*
> >>>>>>> * Scan the entire LLC domain for idle cores; this dynamically switches off if
> >>>>>>> * there are no idle cores left in the system; tracked through
> >>>>>>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>>>>>>
> >>>>>>> time = cpu_clock(this);
> >>>>>>>
> >>>>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>>>>>> + /*
> >>>>>>> + * sched_domain_shared is set only at shared cache level,
> >>>>>>> + * this works only because select_idle_cpu is called with
> >>>>>>> + * sd_llc.
> >>>>>>> + */
> >>>>>>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>>>>>>
> >>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
> >>>>>>> if (!--nr)
> >>>>>>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>>>>>>
> >>>>>>> if (want_affine)
> >>>>>>> current->recent_used_cpu = cpu;
> >>>>>>> +
> >>>>>>> + sd = rcu_dereference(per_cpu(sd_llc, new_cpu));
> >>>>>>> + if (sd && sd->shared)
> >>>>>>> + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
> >>>>>>
> >>>>>> Why are you clearing the bit only for the fast path ? the slow path
> >>>>>> can also select an idle CPU
> >>>>
> >>>> Right, I saw idle core searching is turned off in the fast path only too,
> >>>> because next wakeup we'll check if the CPU is idle, this only affects the
> >>>> idle cpu searching span.
> >>>>
> >>>>>>
> >>>>>> Then, I'm afraid that updating a cpumask at each and every task wakeup
> >>>>>> will be far too expensive. That's why we are ot updating
> >>>>>
> >>>>> That's why we are not updating
> >>>>
> >>>> AFAIK, uperf/netperf is the workload with bunches of short idles, do you
> >>>> have any other workloads in your mind? I can measure to verify this.
> >>>>>
> >>>>>> nohz.idle_cpus_mask at each and every enter/exit idle but only once
> >>>>>> per tick.
> >>>> Yes, agreed, need more think about this, especially if the data is really
> >>>> bad.
> >>>>
> >>>>>>
> >>>>>> And a quick test with hackbench on my octo cores arm64 gives for 12
> >>>>>> iterations of: hackbench -l 2560 -g 1
> >>>>>> tip/sched/core : 1.324(+/- 1.26%)
> >>>>>> with this patch : 2.419(+/- 12.31%) -82% regression
> >>>>
> >>>> Can you please clarify this, is this running 2560 loops and 1 group?
> >>>
> >>> yes it's 2560 loops and 1 group and I run 12 times the bench:
> >>> $ hackbench -l 2560 -g 1
> >>> Running in process mode with 1 groups using 40 file descriptors each
> >>> (== 40 tasks)
> >>> Each sender will pass 2560 messages of 100 bytes
> >>> Time: 2.953
> >>>
> >>> you can also have a look at perf sched pipe
> >>> tip/sched/core
> >>> $ perf bench sched pipe -T -l 50000
> >>> # Running 'sched/pipe' benchmark:
> >>> # Executed 50000 pipe operations between two threads
> >>>
> >>> Total time: 0.980 [sec]
> >>>
> >>> 19.609160 usecs/op
> >>> 50996 ops/sec
> >>>
> >>> With your patch :
> >>> $ perf bench sched pipe -T -l 50000
> >>> # Running 'sched/pipe' benchmark:
> >>> # Executed 50000 pipe operations between two threads
> >>>
> >>> Total time: 1.283 [sec]
> >>>
> >>> 25.674200 usecs/op
> >>> 38949 ops/sec
> >>>
> >>> which is a 23% regression
> >>
> >> This is interesting, 10 iterations data on my side:
> >>
> >> $ cat upstream.log | grep Total
> >> Total time: 0.799 [sec]
> >> Total time: 0.513 [sec]
> >> Total time: 0.658 [sec]
> >> Total time: 0.499 [sec]
> >> Total time: 0.644 [sec]
> >> Total time: 0.758 [sec]
> >> Total time: 0.576 [sec]
> >> Total time: 0.864 [sec]
> >> Total time: 0.566 [sec]
> >> Total time: 0.694 [sec]
> >> Total time: 0.609 [sec]
> >>
> >> $ cat idle_cpumask.log | grep Total
> >> Total time: 0.595 [sec]
> >> Total time: 0.519 [sec]
> >> Total time: 0.592 [sec]
> >> Total time: 0.504 [sec]
> >> Total time: 0.440 [sec]
> >> Total time: 0.676 [sec]
> >> Total time: 0.683 [sec]
> >> Total time: 0.437 [sec]
> >> Total time: 0.460 [sec]
> >> Total time: 0.473 [sec]
> >> Total time: 0.478 [sec]
> >>
> >> That is, there is a 18.53% improvement.
> >> ========================================
> >> cases avg std
> >> 5.8.10 0.653 17.958
> >> idle_cpumask 0.532 16.916
> >>
> >> Is this caused by the architecture specific atomic operations?
> >> I also noticed I have shorter total time of both hackbench and perf sched
> >> pipe.
> >
> > TBH, I'm surprised that perf sched pipe is shorter with your patch. It
> > adds more code to run but don't take advantage of a shorter cpumask in
> > select_idle_cpu() because it either skips select_idle_cpu() if
> > previous cpu is idle which should be the case in this light loaded
> > benchmark or the cpumask should be quite the same as the sched_domain
> > span
> >
> > The sequence is :
> >
> > select_task_rq_fair()
> > select prev which should be idle
> > +cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared));
> >
> > pick_next_task_fair()
> >
> > set_next_task_fair()
> > +unlikely update_idle_cpumask(rq);
> >
> > wake up other side
> >
> > go back to sleep
> >
> > put_prev_task_fair
> >
> > pick_next_task_idle
> >
> > set_next_task_idle
> > +update_idle_cpumask(rq)
>
> I trend to agree with this, but 20+% regression seems not reasonable to me
> as well, so I run 100 iterations sched-pipe to verify and here is the data:
>
> baseline 5.8.10, Score: avg - 0.55, std - 21.73%
> idle_cpumask, Score: avg - 0.52, std - 16.42%
The stdev of your results looks quite high for a perf bench sched
pipe. I usually got something around 1%
>
> At least I didn't see the regression.
> >
> >>
> >> Would you mind share uperf(netperf load) result on your side? That's the
> >> workload I have seen the most benefit this patch contributed under heavy
> >> load level.
> >
> > with uperf, i've got the same kind of result as sched pipe
> > tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
> > with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
> > regression as for sched pipe
> >
> In case this is caused by the logic error in this patch(sorry again), did
> you see any improvement in patch V2? Though it does not helps for nohz=off
> case, just want to know if it helps or does not help at all on arm platform.
With the v2 which rate limit the update of the cpumask (but doesn't
support sched_idle stask), I don't see any performance impact:
perf bench sched pipe -T -l 50000
# Running 'sched/pipe' benchmark:
# Executed 50000 pipe operations between two threads
Total time: 0.983 [sec]
19.666880 usecs/op
50846 ops/sec
Vincent
>
> Thanks,
> -Aubrey
>
> >>
> >> Maybe I should look for the architecture specific implementation? Any clue?
> >>
> >>>
> >>>> 10 iterations "./hackbench 1 process 2560" on my side are:
> >>>>
> >>>> 5.8.10: 0.14(+/- 12.01%)
> >>>> =========================
> >>>> [0.089, 0.148, 0.147, 0.141, 0.143, 0.143, 0.143, 0.146, 0.143, 0.142]
> >>>> Score: avg - 0.1385, std - 12.01%
> >>>>
> >>>> With this patch
> >>>> ================
> >>>> [0.095, 0.142, 0.143, 0.142, 0.15, 0.146, 0.144, 0.145, 0.143, 0.145]
> >>>> Score: avg - 0.1395, std - 10.88%
> >>>>
> >>>> I didn't see such big regression.
> >>>>
> >>>>>>
> >>>>>>> }
> >>>>>>> rcu_read_unlock();
> >>>>>>>
> >>>>>>> @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>>>>>> /* ensure bandwidth has been allocated on our new cfs_rq */
> >>>>>>> account_cfs_rq_runtime(cfs_rq, 0);
> >>>>>>> }
> >>>>>>> + /* Update idle cpumask if task has idle policy */
> >>>>>>> + if (unlikely(task_has_idle_policy(p)))
> >>>>>>> + update_idle_cpumask(rq);
> >>>>>>
> >>>>>> it's wrong because a sched_idle task will run for time to time even
> >>>>>> when some cfs tasks are runnable
> >>>>>>
> >>>> Sorry I didn't get your point. The intention here is to add a SCHED_IDLE cpu to the idle cpumask,
> >>>> so that this cpu can be used as a target for wakeup preemption.
> >>>
> >>> a cpu with sched_idle tasks can be considered idle iff there is only
> >>> sched_idle tasks runnable. Look at sched_idle_cpu()
> >>>
> >> You are right, thanks to point this out.
> >>
> >> Thanks,
> >> -Aubrey
> >>>>
> >>>>>>> }
> >>>>>>>
> >>>>>>> void init_cfs_rq(struct cfs_rq *cfs_rq)
> >>>>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >>>>>>> index 1ae95b9150d3..876dfdfe35bb 100644
> >>>>>>> --- a/kernel/sched/idle.c
> >>>>>>> +++ b/kernel/sched/idle.c
> >>>>>>> @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> >>>>>>> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> >>>>>>> {
> >>>>>>> update_idle_core(rq);
> >>>>>>> + update_idle_cpumask(rq);
> >>>>>>> schedstat_inc(rq->sched_goidle);
> >>>>>>> }
> >>>>>>>
> >>>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>>>>> index c82857e2e288..7a3355f61bcf 100644
> >>>>>>> --- a/kernel/sched/sched.h
> >>>>>>> +++ b/kernel/sched/sched.h
> >>>>>>> @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq)
> >>>>>>> #else
> >>>>>>> static inline void update_idle_core(struct rq *rq) { }
> >>>>>>> #endif
> >>>>>>> +void update_idle_cpumask(struct rq *rq);
> >>>>>>>
> >>>>>>> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >>>>>>>
> >>>>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >>>>>>> index 9079d865a935..f14a6ef4de57 100644
> >>>>>>> --- a/kernel/sched/topology.c
> >>>>>>> +++ b/kernel/sched/topology.c
> >>>>>>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >>>>>>> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >>>>>>> atomic_inc(&sd->shared->ref);
> >>>>>>> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> >>>>>>> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> >>>>>>> }
> >>>>>>>
> >>>>>>> sd->private = sdd;
> >>>>>>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >>>>>>>
> >>>>>>> *per_cpu_ptr(sdd->sd, j) = sd;
> >>>>>>>
> >>>>>>> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> >>>>>>> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> >>>>>>> GFP_KERNEL, cpu_to_node(j));
> >>>>>>> if (!sds)
> >>>>>>> return -ENOMEM;
> >>>>>>>
> >>>>
> >>
>
On 9/22/20 12:14 AM, Vincent Guittot wrote:
>>
>>>>
>>>> And a quick test with hackbench on my octo cores arm64 gives for 12
Vincent,
Is it octo (=10) or octa (=8) cores on a single socket for your system?
The L2 is per core or there are multiple L2s shared among groups of cores?
Wonder if placing the threads within a L2 or not within
an L2 could cause differences seen with Aubrey's test.
Tim
On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
>
>
> On 9/22/20 12:14 AM, Vincent Guittot wrote:
>
> >>
> >>>>
> >>>> And a quick test with hackbench on my octo cores arm64 gives for 12
>
> Vincent,
>
> Is it octo (=10) or octa (=8) cores on a single socket for your system?
In what Romance language does octo mean 10? :)
> The L2 is per core or there are multiple L2s shared among groups of cores?
>
> Wonder if placing the threads within a L2 or not within
> an L2 could cause differences seen with Aubrey's test.
>
> Tim
>
--
On 9/24/20 10:13 AM, Phil Auld wrote:
> On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
>>
>>
>> On 9/22/20 12:14 AM, Vincent Guittot wrote:
>>
>>>>
>>>>>>
>>>>>> And a quick test with hackbench on my octo cores arm64 gives for 12
>>
>> Vincent,
>>
>> Is it octo (=10) or octa (=8) cores on a single socket for your system?
>
> In what Romance language does octo mean 10? :)
>
Got confused by october, the tenth month. :)
Tim
>
>> The L2 is per core or there are multiple L2s shared among groups of cores?
>>
>> Wonder if placing the threads within a L2 or not within
>> an L2 could cause differences seen with Aubrey's test.
>>
>> Tim
>>
>
On Thu, Sep 24, 2020 at 10:43:12AM -0700 Tim Chen wrote:
>
>
> On 9/24/20 10:13 AM, Phil Auld wrote:
> > On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
> >>
> >>
> >> On 9/22/20 12:14 AM, Vincent Guittot wrote:
> >>
> >>>>
> >>>>>>
> >>>>>> And a quick test with hackbench on my octo cores arm64 gives for 12
> >>
> >> Vincent,
> >>
> >> Is it octo (=10) or octa (=8) cores on a single socket for your system?
> >
> > In what Romance language does octo mean 10? :)
> >
>
> Got confused by october, the tenth month. :)
It used to be the eigth month ;)
>
> Tim
>
> >
> >> The L2 is per core or there are multiple L2s shared among groups of cores?
> >>
> >> Wonder if placing the threads within a L2 or not within
> >> an L2 could cause differences seen with Aubrey's test.
> >>
> >> Tim
> >>
> >
>
--
Hi Tim
On Thu, 24 Sep 2020 at 18:37, Tim Chen <[email protected]> wrote:
>
>
>
> On 9/22/20 12:14 AM, Vincent Guittot wrote:
>
> >>
> >>>>
> >>>> And a quick test with hackbench on my octo cores arm64 gives for 12
>
> Vincent,
>
> Is it octo (=10) or octa (=8) cores on a single socket for your system?
it's a 8 cores and the cores are splitted in 2 cache domains
> The L2 is per core or there are multiple L2s shared among groups of cores?
>
> Wonder if placing the threads within a L2 or not within
> an L2 could cause differences seen with Aubrey's test.
I haven't checked recently but the 2 tasks involved in sched pipe run
on CPUs which belong to the same cache domain
Vincent
>
> Tim
>
Hi Vicent,
On 2020/9/24 21:09, Vincent Guittot wrote:
>>>>
>>>> Would you mind share uperf(netperf load) result on your side? That's the
>>>> workload I have seen the most benefit this patch contributed under heavy
>>>> load level.
>>>
>>> with uperf, i've got the same kind of result as sched pipe
>>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
>>> with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
>>> regression as for sched pipe
>>>
>> In case this is caused by the logic error in this patch(sorry again), did
>> you see any improvement in patch V2? Though it does not helps for nohz=off
>> case, just want to know if it helps or does not help at all on arm platform.
>
> With the v2 which rate limit the update of the cpumask (but doesn't
> support sched_idle stask), I don't see any performance impact:
I agree we should go the way with cpumask update rate limited.
And I think no performance impact for sched-pipe is expected, as this workload
has only 2 threads and the platform has 8 cores, so mostly previous cpu is
returned, and even if select_idle_sibling is called, select_idle_core is hit
and rarely call select_idle_cpu.
But I'm more curious why there is 23% performance penalty? So for this patch, if
you revert this change but keep cpumask updated, is 23% penalty still there?
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
I just wonder if it's caused by the atomic ops as you have two cache domains with
sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to find
an ARM machine but I'll try.
Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no any
performance impact?
Thanks,
-Aubrey
Le vendredi 25 sept. 2020 ? 17:21:46 (+0800), Li, Aubrey a ?crit :
> Hi Vicent,
>
> On 2020/9/24 21:09, Vincent Guittot wrote:
> >>>>
> >>>> Would you mind share uperf(netperf load) result on your side? That's the
> >>>> workload I have seen the most benefit this patch contributed under heavy
> >>>> load level.
> >>>
> >>> with uperf, i've got the same kind of result as sched pipe
> >>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
> >>> with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
> >>> regression as for sched pipe
> >>>
> >> In case this is caused by the logic error in this patch(sorry again), did
> >> you see any improvement in patch V2? Though it does not helps for nohz=off
> >> case, just want to know if it helps or does not help at all on arm platform.
> >
> > With the v2 which rate limit the update of the cpumask (but doesn't
> > support sched_idle stask), I don't see any performance impact:
>
> I agree we should go the way with cpumask update rate limited.
>
> And I think no performance impact for sched-pipe is expected, as this workload
> has only 2 threads and the platform has 8 cores, so mostly previous cpu is
> returned, and even if select_idle_sibling is called, select_idle_core is hit
> and rarely call select_idle_cpu.
my platform is not smt so select_idle_core is nop. Nevertheless select_idle_cpu
is almost never called because prev is idle and selected before calling it in
our case
>
> But I'm more curious why there is 23% performance penalty? So for this patch, if
> you revert this change but keep cpumask updated, is 23% penalty still there?
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
I was about to say that reverting this line should not change anything because
we never reach this point but it does in fact. And after looking at a trace,
I can see that the 2 threads of perf bench sched pipe are on the same CPU and
that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is
not yet idle and still point to the cfs task when you call update_idle_cpumask().
This means that once cleared, the bit will never be set
You can remove the test in update_idle_cpumask() which is called either when
entering idle or when there is only sched_idle tasks that are runnable.
@@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq)
sd = rcu_dereference(per_cpu(sd_llc, cpu));
if (!sd || !sd->shared)
goto unlock;
- if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
- goto unlock;
+
cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
With this fix, the performance decrease is only 2%
>
> I just wonder if it's caused by the atomic ops as you have two cache domains with
> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to find
> an ARM machine but I'll try.
>
> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no any
> performance impact?
with v2 : ?Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression
>
>
> Thanks,
> -Aubrey
On 2020/9/26 0:45, Vincent Guittot wrote:
> Le vendredi 25 sept. 2020 à 17:21:46 (+0800), Li, Aubrey a écrit :
>> Hi Vicent,
>>
>> On 2020/9/24 21:09, Vincent Guittot wrote:
>>>>>>
>>>>>> Would you mind share uperf(netperf load) result on your side? That's the
>>>>>> workload I have seen the most benefit this patch contributed under heavy
>>>>>> load level.
>>>>>
>>>>> with uperf, i've got the same kind of result as sched pipe
>>>>> tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%)
>>>>> with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23%
>>>>> regression as for sched pipe
>>>>>
>>>> In case this is caused by the logic error in this patch(sorry again), did
>>>> you see any improvement in patch V2? Though it does not helps for nohz=off
>>>> case, just want to know if it helps or does not help at all on arm platform.
>>>
>>> With the v2 which rate limit the update of the cpumask (but doesn't
>>> support sched_idle stask), I don't see any performance impact:
>>
>> I agree we should go the way with cpumask update rate limited.
>>
>> And I think no performance impact for sched-pipe is expected, as this workload
>> has only 2 threads and the platform has 8 cores, so mostly previous cpu is
>> returned, and even if select_idle_sibling is called, select_idle_core is hit
>> and rarely call select_idle_cpu.
>
> my platform is not smt so select_idle_core is nop. Nevertheless select_idle_cpu
> is almost never called because prev is idle and selected before calling it in
> our case
>
>>
>> But I'm more curious why there is 23% performance penalty? So for this patch, if
>> you revert this change but keep cpumask updated, is 23% penalty still there?
>>
>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
> I was about to say that reverting this line should not change anything because
> we never reach this point but it does in fact. And after looking at a trace,
> I can see that the 2 threads of perf bench sched pipe are on the same CPU and
> that the sds_idle_cpus(sd->shared) is always empty. In fact, the rq->curr is
> not yet idle and still point to the cfs task when you call update_idle_cpumask().
> This means that once cleared, the bit will never be set
> You can remove the test in update_idle_cpumask() which is called either when
> entering idle or when there is only sched_idle tasks that are runnable.
>
> @@ -6044,8 +6044,7 @@ void update_idle_cpumask(struct rq *rq)
> sd = rcu_dereference(per_cpu(sd_llc, cpu));
> if (!sd || !sd->shared)
> goto unlock;
> - if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu))
> - goto unlock;
> +
> cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
>
> With this fix, the performance decrease is only 2%
>
>>
>> I just wonder if it's caused by the atomic ops as you have two cache domains with
>> sd_llc(?). Do you have a x86 machine to make a comparison? It's hard for me to find
>> an ARM machine but I'll try.
>>
>> Also, for uperf(task thread num = cpu num) workload, how is it on patch v2? no any
>> performance impact?
>
> with v2 : Throughput 24.97Mb/s (+/- 0.07%) so there is no perf regression
>
Thanks Vincent, let me try to refine this patch.
-Aubrey