This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().
Waiman Long (3):
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
sched/fair: Move hot load_avg into its own cacheline
sched/fair: Use different cachelines for readers and writers of
load_avg
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 15 ++++++++++++++-
3 files changed, 58 insertions(+), 6 deletions(-)
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.
With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.
This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().
Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/fair.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8..8f1eccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
bool *overload)
{
unsigned long load;
- int i;
+ int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
@@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
- if (rq->nr_running > 1)
+ nr_running = rq->nr_running;
+ if (nr_running > 1)
*overload = true;
#ifdef CONFIG_NUMA_BALANCING
@@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
sgs->sum_weighted_load += weighted_cpuload(i);
- if (idle_cpu(i))
+ /*
+ * No need to call idle_cpu() if nr_running is not 0
+ */
+ if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
}
--
1.7.1
If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
was enabled which was true for most distributions.
Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:
10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.
This patch moves the load_avg variable into another cacheline separated
from the other frequently accessed variables. By doing so, the perf
profile became:
9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:
Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366
Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/sched.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;
#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif
--
1.7.1
The load_avg statistical counter is only changed if the load on a CPU
deviates significantly from the previous tick. So there are usually
more readers than writers of load_avg. Still, on a large system,
the cacheline contention can cause significant slowdown and impact
performance.
This patch attempts to separate those load_avg readers
(update_cfs_shares) and writers (task_tick_fair) to use different
cachelines instead. Writers of load_avg will now accumulates the
load delta into load_avg_delta which sits in a different cacheline.
If load_avg_delta is sufficiently large (> load_avg/64), it will then
be added back to load_avg.
Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
480 threads), the perf profile before the patch was:
9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
After the patch, it became:
2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
The benchmark results before and after the patch were as follows:
Before patch - Max-jOPs: 916011 Critical-jOps: 142366
AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
There was significant improvement in Critical-jOps which was latency
sensitive.
This patch does introduce additional delay in getting the real load
average reflected in load_avg. It may also incur additional overhead
if the number of CPUs in a task group is small. As a result, this
change is only activated when running on a 4-socket or larger systems
which can get the most benefit from it.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
kernel/sched/sched.h | 8 ++++++++
3 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..f3075da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7356,6 +7356,12 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
+#ifdef CONFIG_SMP
+ /*
+ * Use load_avg_delta if not 2P or less
+ */
+ root_task_group.use_la_delta = (num_possible_nodes() > 2);
+#endif /* CONFIG_SMP */
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ tg->use_la_delta = root_task_group.use_la_delta;
+#endif
return tg;
err:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..44732cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
* and effective_load (which is not done because it is too costly).
+ *
+ * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
+ * accumulated into the load_avg_delta variable instead to reduce cacheline
+ * contention on load_avg at the expense of more delay in reflecting the real
+ * load_avg. The tg's load_avg and load_avg_delta variables are in separate
+ * cachelines. With that flag set, load_avg will be read mostly whereas
+ * load_avg_delta will be write mostly.
*/
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
- atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ struct task_group *tg = cfs_rq->tg;
+ long load_avg, tot_delta;
+
+ if (!tg->use_la_delta) {
+ /*
+ * If the use_la_delta isn't set, just add the
+ * delta directly into load_avg.
+ */
+ atomic_long_add(delta, &tg->load_avg);
+ goto set_contrib;
+ }
+
+ tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
+ load_avg = atomic_long_read(&tg->load_avg);
+ if (abs(tot_delta) > load_avg / 64) {
+ tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
+ if (tot_delta)
+ atomic_long_add(tot_delta, &tg->load_avg);
+ }
+set_contrib:
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e679895..aef4e4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -252,8 +252,16 @@ struct task_group {
* load_avg can be heavily contended at clock tick time, so put
* it in its own cacheline separated from the fields above which
* will also be accessed at each tick.
+ *
+ * The use_la_delta flag, if set, will enable the use of load_avg_delta
+ * to accumulate the delta and only change load_avg when the delta
+ * is big enough. This reduces the cacheline contention on load_avg.
+ * This flag will be set at allocation time depending on the system
+ * configuration.
*/
+ int use_la_delta;
atomic_long_t load_avg ____cacheline_aligned;
+ atomic_long_t load_avg_delta ____cacheline_aligned;
#endif
#endif
--
1.7.1
Please always Cc the people who wrote the code.
+CC pjt, ben, morten, yuyang
On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
> The load_avg statistical counter is only changed if the load on a CPU
> deviates significantly from the previous tick. So there are usually
> more readers than writers of load_avg. Still, on a large system,
> the cacheline contention can cause significant slowdown and impact
> performance.
>
> This patch attempts to separate those load_avg readers
> (update_cfs_shares) and writers (task_tick_fair) to use different
> cachelines instead. Writers of load_avg will now accumulates the
> load delta into load_avg_delta which sits in a different cacheline.
> If load_avg_delta is sufficiently large (> load_avg/64), it will then
> be added back to load_avg.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
> 480 threads), the perf profile before the patch was:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> After the patch, it became:
>
> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.
That is; isn't this simply a problem we should avoid?
> The benchmark results before and after the patch were as follows:
>
> Before patch - Max-jOPs: 916011 Critical-jOps: 142366
> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
>
> There was significant improvement in Critical-jOps which was latency
> sensitive.
>
> This patch does introduce additional delay in getting the real load
> average reflected in load_avg. It may also incur additional overhead
> if the number of CPUs in a task group is small. As a result, this
> change is only activated when running on a 4-socket or larger systems
> which can get the most benefit from it.
So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 8 ++++++++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..f3075da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
> ptr += nr_cpu_ids * sizeof(void **);
>
> +#ifdef CONFIG_SMP
> + /*
> + * Use load_avg_delta if not 2P or less
> + */
> + root_task_group.use_la_delta = (num_possible_nodes() > 2);
> +#endif /* CONFIG_SMP */
> #endif /* CONFIG_FAIR_GROUP_SCHED */
> #ifdef CONFIG_RT_GROUP_SCHED
> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> + tg->use_la_delta = root_task_group.use_la_delta;
> +#endif
> return tg;
>
> err:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..44732cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /*
> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
> * and effective_load (which is not done because it is too costly).
> + *
> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
> + * accumulated into the load_avg_delta variable instead to reduce cacheline
> + * contention on load_avg at the expense of more delay in reflecting the real
> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
> + * cachelines. With that flag set, load_avg will be read mostly whereas
> + * load_avg_delta will be write mostly.
> */
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> - atomic_long_add(delta, &cfs_rq->tg->load_avg);
> + struct task_group *tg = cfs_rq->tg;
> + long load_avg, tot_delta;
> +
> + if (!tg->use_la_delta) {
> + /*
> + * If the use_la_delta isn't set, just add the
> + * delta directly into load_avg.
> + */
> + atomic_long_add(delta, &tg->load_avg);
> + goto set_contrib;
> + }
> +
> + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
> + load_avg = atomic_long_read(&tg->load_avg);
> + if (abs(tot_delta) > load_avg / 64) {
> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
> + if (tot_delta)
> + atomic_long_add(tot_delta, &tg->load_avg);
> + }
> +set_contrib:
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> }
> }
I'm thinking that its now far too big to retain the inline qualifier.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e679895..aef4e4e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -252,8 +252,16 @@ struct task_group {
> * load_avg can be heavily contended at clock tick time, so put
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> + *
> + * The use_la_delta flag, if set, will enable the use of load_avg_delta
> + * to accumulate the delta and only change load_avg when the delta
> + * is big enough. This reduces the cacheline contention on load_avg.
> + * This flag will be set at allocation time depending on the system
> + * configuration.
> */
> + int use_la_delta;
> atomic_long_t load_avg ____cacheline_aligned;
> + atomic_long_t load_avg_delta ____cacheline_aligned;
This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.
Also, you unconditionally grow the structure by a whole cacheline.
> #endif
> #endif
>
> --
> 1.7.1
>
On Wed, Nov 25, 2015 at 02:09:39PM -0500, Waiman Long wrote:
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
> unsigned long shares;
>
> #ifdef CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> + * load_avg can be heavily contended at clock tick time, so put
> + * it in its own cacheline separated from the fields above which
> + * will also be accessed at each tick.
> + */
> + atomic_long_t load_avg ____cacheline_aligned;
Same as with the other patch; this only works if the structure itself is
cacheline aligned, which I don't think it is.
> #endif
> #endif
>
> --
> 1.7.1
>
On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> Please always Cc the people who wrote the code.
>
> +CC pjt, ben, morten, yuyang
Sorry for that. Their names didn't show up when I did get_maintainer.pl.
> On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
>> The load_avg statistical counter is only changed if the load on a CPU
>> deviates significantly from the previous tick. So there are usually
>> more readers than writers of load_avg. Still, on a large system,
>> the cacheline contention can cause significant slowdown and impact
>> performance.
>>
>> This patch attempts to separate those load_avg readers
>> (update_cfs_shares) and writers (task_tick_fair) to use different
>> cachelines instead. Writers of load_avg will now accumulates the
>> load delta into load_avg_delta which sits in a different cacheline.
>> If load_avg_delta is sufficiently large (> load_avg/64), it will then
>> be added back to load_avg.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
>> 480 threads), the perf profile before the patch was:
>>
>> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
>> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> After the patch, it became:
>>
>> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
>> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
>> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
> This begs the question tough; why are you running a global load in a
> cgroup; and do we really need to update this for the root cgroup? It
> seems to me we don't need calc_tg_weight() for the root cgroup, it
> doesn't need to normalize its weight numbers.
>
> That is; isn't this simply a problem we should avoid?
I didn't use any cgroup in my test setup. Autogroup was enabled, though.
Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
times.
>> The benchmark results before and after the patch were as follows:
>>
>> Before patch - Max-jOPs: 916011 Critical-jOps: 142366
>> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
>>
>> There was significant improvement in Critical-jOps which was latency
>> sensitive.
>>
>> This patch does introduce additional delay in getting the real load
>> average reflected in load_avg. It may also incur additional overhead
>> if the number of CPUs in a task group is small. As a result, this
>> change is only activated when running on a 4-socket or larger systems
>> which can get the most benefit from it.
> So I'm not particularly charmed by this; it rather makes a mess of
> things. Also this really wants a run of the cgroup fairness test thingy
> pjt/ben have somewhere.
I will be glad to run any additional tests, if necessary. I do need
pointers to those test, though.
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/sched/core.c | 9 +++++++++
>> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
>> kernel/sched/sched.h | 8 ++++++++
>> 3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..f3075da 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>> ptr += nr_cpu_ids * sizeof(void **);
>>
>> +#ifdef CONFIG_SMP
>> + /*
>> + * Use load_avg_delta if not 2P or less
>> + */
>> + root_task_group.use_la_delta = (num_possible_nodes()> 2);
>> +#endif /* CONFIG_SMP */
>> #endif /* CONFIG_FAIR_GROUP_SCHED */
>> #ifdef CONFIG_RT_GROUP_SCHED
>> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>> if (!alloc_rt_sched_group(tg, parent))
>> goto err;
>>
>> +#if defined(CONFIG_FAIR_GROUP_SCHED)&& defined(CONFIG_SMP)
>> + tg->use_la_delta = root_task_group.use_la_delta;
>> +#endif
>> return tg;
>>
>> err:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8f1eccc..44732cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> /*
>> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
>> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>> * and effective_load (which is not done because it is too costly).
>> + *
>> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
>> + * accumulated into the load_avg_delta variable instead to reduce cacheline
>> + * contention on load_avg at the expense of more delay in reflecting the real
>> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
>> + * cachelines. With that flag set, load_avg will be read mostly whereas
>> + * load_avg_delta will be write mostly.
>> */
>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> {
>> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>
>> if (force || abs(delta)> cfs_rq->tg_load_avg_contrib / 64) {
>> - atomic_long_add(delta,&cfs_rq->tg->load_avg);
>> + struct task_group *tg = cfs_rq->tg;
>> + long load_avg, tot_delta;
>> +
>> + if (!tg->use_la_delta) {
>> + /*
>> + * If the use_la_delta isn't set, just add the
>> + * delta directly into load_avg.
>> + */
>> + atomic_long_add(delta,&tg->load_avg);
>> + goto set_contrib;
>> + }
>> +
>> + tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
>> + load_avg = atomic_long_read(&tg->load_avg);
>> + if (abs(tot_delta)> load_avg / 64) {
>> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
>> + if (tot_delta)
>> + atomic_long_add(tot_delta,&tg->load_avg);
>> + }
>> +set_contrib:
>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> }
>> }
> I'm thinking that its now far too big to retain the inline qualifier.
I can take the inline keyword out.
>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e679895..aef4e4e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -252,8 +252,16 @@ struct task_group {
>> * load_avg can be heavily contended at clock tick time, so put
>> * it in its own cacheline separated from the fields above which
>> * will also be accessed at each tick.
>> + *
>> + * The use_la_delta flag, if set, will enable the use of load_avg_delta
>> + * to accumulate the delta and only change load_avg when the delta
>> + * is big enough. This reduces the cacheline contention on load_avg.
>> + * This flag will be set at allocation time depending on the system
>> + * configuration.
>> */
>> + int use_la_delta;
>> atomic_long_t load_avg ____cacheline_aligned;
>> + atomic_long_t load_avg_delta ____cacheline_aligned;
> This would only work if the structure itself is allocated with cacheline
> alignment, and looking at sched_create_group(), we use a plain kzalloc()
> for this, which doesn't guarantee any sort of alignment beyond machine
> word size IIRC.
With a RHEL 6 derived .config file, the size of the task_group structure
was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
increase the size to 512 bytes. So it did make the structure a multiple
of the cacheline size. With both slub and slab, the allocated task group
pointers from kzalloc() in sched_create_group() were all multiples of
0x200. So they were properly aligned for the ____cacheline_aligned tag
to work.
> Also, you unconditionally grow the structure by a whole cacheline.
I know it is a drawback of using ____cacheline_aligned tag. However, we
probably won't create too many task groups in normal use. So the
increase in memory consumption shouldn't be noticeable.
Cheers,
Longman
On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This begs the question tough; why are you running a global load in a
> >cgroup; and do we really need to update this for the root cgroup? It
> >seems to me we don't need calc_tg_weight() for the root cgroup, it
> >doesn't need to normalize its weight numbers.
> >
> >That is; isn't this simply a problem we should avoid?
>
> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> times.
Yeah, can you kill autogroup and see if that helps? If not, we probably
should add some code to avoid calculating things for the root group.
On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This would only work if the structure itself is allocated with cacheline
> >alignment, and looking at sched_create_group(), we use a plain kzalloc()
> >for this, which doesn't guarantee any sort of alignment beyond machine
> >word size IIRC.
>
> With a RHEL 6 derived .config file, the size of the task_group structure was
> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
> increase the size to 512 bytes. So it did make the structure a multiple of
> the cacheline size. With both slub and slab, the allocated task group
> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
> So they were properly aligned for the ____cacheline_aligned tag to work.
Not sure we should rely on sl*b doing the right thing here.
KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
explicit alignment, one should use KMEM_CACHE().
On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> >Please always Cc the people who wrote the code.
> >
> >+CC pjt, ben, morten, yuyang
>
> Sorry for that. Their names didn't show up when I did get_maintainer.pl.
Ah, I never much use get_maintainers.pl, but it might that --git-blame
would yield some of the people who poked at this code. But I suspect its
all yuyang now, seeing how he recently reworked it all.
On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This begs the question tough; why are you running a global load in a
>>> cgroup; and do we really need to update this for the root cgroup? It
>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>> doesn't need to normalize its weight numbers.
>>>
>>> That is; isn't this simply a problem we should avoid?
>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>> times.
> Yeah, can you kill autogroup and see if that helps? If not, we probably
> should add some code to avoid calculating things for the root group.
I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in
the distribution kernels. So we still need to look at that with
autogroup enabled.
Cheers,
Longman
On 11/30/2015 05:29 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This would only work if the structure itself is allocated with cacheline
>>> alignment, and looking at sched_create_group(), we use a plain kzalloc()
>>> for this, which doesn't guarantee any sort of alignment beyond machine
>>> word size IIRC.
>> With a RHEL 6 derived .config file, the size of the task_group structure was
>> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
>> increase the size to 512 bytes. So it did make the structure a multiple of
>> the cacheline size. With both slub and slab, the allocated task group
>> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
>> So they were properly aligned for the ____cacheline_aligned tag to work.
> Not sure we should rely on sl*b doing the right thing here.
> KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
> explicit alignment, one should use KMEM_CACHE().
I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
requests except when the size is less than or equal to 192 where there
are some non-power-of-2 kmemcaches available. Given that the task_group
structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't
hit the case that the allocated buffer is not cacheline aligned.
Cheers,
Longman
On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:
> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
> requests except when the size is less than or equal to 192 where there are
> some non-power-of-2 kmemcaches available. Given that the task_group
> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
> the case that the allocated buffer is not cacheline aligned.
Using out-of-object storage is allowed (none of the existing sl*b
allocators do so iirc).
That is, its perfectly valid for a sl*b allocator for 64 byte objects to
allocate 72 bytes for each object and use the 'spare' 8 bytes for object
tracking or whatnot.
That would respect the minimum alignment guarantee of 8 bytes but not
provide the 'expected' object size alignment you're assuming.
Also, we have the proper interfaces to request the explicit alignment
for a reason. So if you need the alignment for correctness, use those.
On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >>>This begs the question tough; why are you running a global load in a
> >>>cgroup; and do we really need to update this for the root cgroup? It
> >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> >>>doesn't need to normalize its weight numbers.
> >>>
> >>>That is; isn't this simply a problem we should avoid?
> >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> >>times.
> >Yeah, can you kill autogroup and see if that helps? If not, we probably
> >should add some code to avoid calculating things for the root group.
>
> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> distribution kernels. So we still need to look at that with autogroup
> enabled.
Meh, or just tell the people that have stupid large machines to use
noautogroup on boot (its of questionable benefit in the first place imo,
esp. on servers).
On Tue, 2015-12-01 at 09:49 +0100, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> > On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> > >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> > >>>This begs the question tough; why are you running a global load in a
> > >>>cgroup; and do we really need to update this for the root cgroup? It
> > >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> > >>>doesn't need to normalize its weight numbers.
> > >>>
> > >>>That is; isn't this simply a problem we should avoid?
> > >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> > >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> > >>times.
> > >Yeah, can you kill autogroup and see if that helps? If not, we probably
> > >should add some code to avoid calculating things for the root group.
> >
> > I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> > distribution kernels. So we still need to look at that with autogroup
> > enabled.
>
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).
Yup (and yup). Someone should also suggest to the systemd(isease) folks
that they try actually measuring before turning everything in the world
on. "Oo cgroups are cool" is NOT a good reason to turn it all on :)
-Mike
On 12/01/2015 03:47 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:
>
>> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
>> requests except when the size is less than or equal to 192 where there are
>> some non-power-of-2 kmemcaches available. Given that the task_group
>> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
>> the case that the allocated buffer is not cacheline aligned.
> Using out-of-object storage is allowed (none of the existing sl*b
> allocators do so iirc).
>
> That is, its perfectly valid for a sl*b allocator for 64 byte objects to
> allocate 72 bytes for each object and use the 'spare' 8 bytes for object
> tracking or whatnot.
>
> That would respect the minimum alignment guarantee of 8 bytes but not
> provide the 'expected' object size alignment you're assuming.
>
> Also, we have the proper interfaces to request the explicit alignment
> for a reason. So if you need the alignment for correctness, use those.
Thanks for the tip. I have just sent out an updated patch set which
create a cache-aligned memcache for task group. That should work under
all kernel config setting.
Cheers,
Longman
On 12/01/2015 03:49 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
>> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
>>> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>>>> This begs the question tough; why are you running a global load in a
>>>>> cgroup; and do we really need to update this for the root cgroup? It
>>>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>>>> doesn't need to normalize its weight numbers.
>>>>>
>>>>> That is; isn't this simply a problem we should avoid?
>>>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>>>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>>>> times.
>>> Yeah, can you kill autogroup and see if that helps? If not, we probably
>>> should add some code to avoid calculating things for the root group.
>> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
>> distribution kernels. So we still need to look at that with autogroup
>> enabled.
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).
Yes, I was able to recover most of the lost performance by disabling
autogroup. I did send out a new patch to disable load_avg update for
root_task_group. I need that for backporting to earlier kernels which
was forced to update load_avg for every clock tick even for root_task_group.
Cheers,
Longman
Commit-ID: a426f99c91d1036767a7819aaaba6bd3191b7f06
Gitweb: http://git.kernel.org/tip/a426f99c91d1036767a7819aaaba6bd3191b7f06
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 25 Nov 2015 14:09:38 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 4 Dec 2015 10:34:47 +0100
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.
With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.
This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efd664c..4b0e8b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6398,7 +6398,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
bool *overload)
{
unsigned long load;
- int i;
+ int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
@@ -6415,7 +6415,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
- if (rq->nr_running > 1)
+ nr_running = rq->nr_running;
+ if (nr_running > 1)
*overload = true;
#ifdef CONFIG_NUMA_BALANCING
@@ -6423,7 +6424,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
sgs->sum_weighted_load += weighted_cpuload(i);
- if (idle_cpu(i))
+ /*
+ * No need to call idle_cpu() if nr_running is not 0
+ */
+ if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
}