2015-11-25 19:10:13

by Waiman Long

[permalink] [raw]
Subject: [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg

This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().

Waiman Long (3):
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
sched/fair: Move hot load_avg into its own cacheline
sched/fair: Use different cachelines for readers and writers of
load_avg

kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 15 ++++++++++++++-
3 files changed, 58 insertions(+), 6 deletions(-)


2015-11-25 19:10:18

by Waiman Long

[permalink] [raw]
Subject: [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/fair.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8..8f1eccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
bool *overload)
{
unsigned long load;
- int i;
+ int i, nr_running;

memset(sgs, 0, sizeof(*sgs));

@@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

- if (rq->nr_running > 1)
+ nr_running = rq->nr_running;
+ if (nr_running > 1)
*overload = true;

#ifdef CONFIG_NUMA_BALANCING
@@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
sgs->sum_weighted_load += weighted_cpuload(i);
- if (idle_cpu(i))
+ /*
+ * No need to call idle_cpu() if nr_running is not 0
+ */
+ if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
}

--
1.7.1

2015-11-25 19:11:11

by Waiman Long

[permalink] [raw]
Subject: [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
was enabled which was true for most distributions.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline separated
from the other frequently accessed variables. By doing so, the perf
profile became:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/sched.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;

#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif

--
1.7.1

2015-11-25 19:11:08

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

The load_avg statistical counter is only changed if the load on a CPU
deviates significantly from the previous tick. So there are usually
more readers than writers of load_avg. Still, on a large system,
the cacheline contention can cause significant slowdown and impact
performance.

This patch attempts to separate those load_avg readers
(update_cfs_shares) and writers (task_tick_fair) to use different
cachelines instead. Writers of load_avg will now accumulates the
load delta into load_avg_delta which sits in a different cacheline.
If load_avg_delta is sufficiently large (> load_avg/64), it will then
be added back to load_avg.

Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
480 threads), the perf profile before the patch was:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

After the patch, it became:

2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares

The benchmark results before and after the patch were as follows:

Before patch - Max-jOPs: 916011 Critical-jOps: 142366
AFter patch - Max-jOPs: 939130 Critical-jOps: 211937

There was significant improvement in Critical-jOps which was latency
sensitive.

This patch does introduce additional delay in getting the real load
average reflected in load_avg. It may also incur additional overhead
if the number of CPUs in a task group is small. As a result, this
change is only activated when running on a 4-socket or larger systems
which can get the most benefit from it.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
kernel/sched/sched.h | 8 ++++++++
3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..f3075da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7356,6 +7356,12 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

+#ifdef CONFIG_SMP
+ /*
+ * Use load_avg_delta if not 2P or less
+ */
+ root_task_group.use_la_delta = (num_possible_nodes() > 2);
+#endif /* CONFIG_SMP */
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ tg->use_la_delta = root_task_group.use_la_delta;
+#endif
return tg;

err:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..44732cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
* and effective_load (which is not done because it is too costly).
+ *
+ * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
+ * accumulated into the load_avg_delta variable instead to reduce cacheline
+ * contention on load_avg at the expense of more delay in reflecting the real
+ * load_avg. The tg's load_avg and load_avg_delta variables are in separate
+ * cachelines. With that flag set, load_avg will be read mostly whereas
+ * load_avg_delta will be write mostly.
*/
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
- atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ struct task_group *tg = cfs_rq->tg;
+ long load_avg, tot_delta;
+
+ if (!tg->use_la_delta) {
+ /*
+ * If the use_la_delta isn't set, just add the
+ * delta directly into load_avg.
+ */
+ atomic_long_add(delta, &tg->load_avg);
+ goto set_contrib;
+ }
+
+ tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
+ load_avg = atomic_long_read(&tg->load_avg);
+ if (abs(tot_delta) > load_avg / 64) {
+ tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
+ if (tot_delta)
+ atomic_long_add(tot_delta, &tg->load_avg);
+ }
+set_contrib:
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e679895..aef4e4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -252,8 +252,16 @@ struct task_group {
* load_avg can be heavily contended at clock tick time, so put
* it in its own cacheline separated from the fields above which
* will also be accessed at each tick.
+ *
+ * The use_la_delta flag, if set, will enable the use of load_avg_delta
+ * to accumulate the delta and only change load_avg when the delta
+ * is big enough. This reduces the cacheline contention on load_avg.
+ * This flag will be set at allocation time depending on the system
+ * configuration.
*/
+ int use_la_delta;
atomic_long_t load_avg ____cacheline_aligned;
+ atomic_long_t load_avg_delta ____cacheline_aligned;
#endif
#endif

--
1.7.1

2015-11-30 10:22:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg


Please always Cc the people who wrote the code.

+CC pjt, ben, morten, yuyang

On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
> The load_avg statistical counter is only changed if the load on a CPU
> deviates significantly from the previous tick. So there are usually
> more readers than writers of load_avg. Still, on a large system,
> the cacheline contention can cause significant slowdown and impact
> performance.
>
> This patch attempts to separate those load_avg readers
> (update_cfs_shares) and writers (task_tick_fair) to use different
> cachelines instead. Writers of load_avg will now accumulates the
> load delta into load_avg_delta which sits in a different cacheline.
> If load_avg_delta is sufficiently large (> load_avg/64), it will then
> be added back to load_avg.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
> 480 threads), the perf profile before the patch was:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> After the patch, it became:
>
> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares

This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.

That is; isn't this simply a problem we should avoid?

> The benchmark results before and after the patch were as follows:
>
> Before patch - Max-jOPs: 916011 Critical-jOps: 142366
> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
>
> There was significant improvement in Critical-jOps which was latency
> sensitive.
>
> This patch does introduce additional delay in getting the real load
> average reflected in load_avg. It may also incur additional overhead
> if the number of CPUs in a task group is small. As a result, this
> change is only activated when running on a 4-socket or larger systems
> which can get the most benefit from it.

So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.

> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 8 ++++++++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..f3075da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
> ptr += nr_cpu_ids * sizeof(void **);
>
> +#ifdef CONFIG_SMP
> + /*
> + * Use load_avg_delta if not 2P or less
> + */
> + root_task_group.use_la_delta = (num_possible_nodes() > 2);
> +#endif /* CONFIG_SMP */
> #endif /* CONFIG_FAIR_GROUP_SCHED */
> #ifdef CONFIG_RT_GROUP_SCHED
> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> + tg->use_la_delta = root_task_group.use_la_delta;
> +#endif
> return tg;
>
> err:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..44732cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /*
> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
> * and effective_load (which is not done because it is too costly).
> + *
> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
> + * accumulated into the load_avg_delta variable instead to reduce cacheline
> + * contention on load_avg at the expense of more delay in reflecting the real
> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
> + * cachelines. With that flag set, load_avg will be read mostly whereas
> + * load_avg_delta will be write mostly.
> */
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> - atomic_long_add(delta, &cfs_rq->tg->load_avg);
> + struct task_group *tg = cfs_rq->tg;
> + long load_avg, tot_delta;
> +
> + if (!tg->use_la_delta) {
> + /*
> + * If the use_la_delta isn't set, just add the
> + * delta directly into load_avg.
> + */
> + atomic_long_add(delta, &tg->load_avg);
> + goto set_contrib;
> + }
> +
> + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
> + load_avg = atomic_long_read(&tg->load_avg);
> + if (abs(tot_delta) > load_avg / 64) {
> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
> + if (tot_delta)
> + atomic_long_add(tot_delta, &tg->load_avg);
> + }
> +set_contrib:
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> }
> }

I'm thinking that its now far too big to retain the inline qualifier.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e679895..aef4e4e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -252,8 +252,16 @@ struct task_group {
> * load_avg can be heavily contended at clock tick time, so put
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> + *
> + * The use_la_delta flag, if set, will enable the use of load_avg_delta
> + * to accumulate the delta and only change load_avg when the delta
> + * is big enough. This reduces the cacheline contention on load_avg.
> + * This flag will be set at allocation time depending on the system
> + * configuration.
> */
> + int use_la_delta;
> atomic_long_t load_avg ____cacheline_aligned;
> + atomic_long_t load_avg_delta ____cacheline_aligned;

This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.

Also, you unconditionally grow the structure by a whole cacheline.

> #endif
> #endif
>
> --
> 1.7.1
>

2015-11-30 10:23:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline

On Wed, Nov 25, 2015 at 02:09:39PM -0500, Waiman Long wrote:
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
> unsigned long shares;
>
> #ifdef CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> + * load_avg can be heavily contended at clock tick time, so put
> + * it in its own cacheline separated from the fields above which
> + * will also be accessed at each tick.
> + */
> + atomic_long_t load_avg ____cacheline_aligned;

Same as with the other patch; this only works if the structure itself is
cacheline aligned, which I don't think it is.

> #endif
> #endif
>
> --
> 1.7.1
>

2015-11-30 19:13:37

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> Please always Cc the people who wrote the code.
>
> +CC pjt, ben, morten, yuyang

Sorry for that. Their names didn't show up when I did get_maintainer.pl.

> On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
>> The load_avg statistical counter is only changed if the load on a CPU
>> deviates significantly from the previous tick. So there are usually
>> more readers than writers of load_avg. Still, on a large system,
>> the cacheline contention can cause significant slowdown and impact
>> performance.
>>
>> This patch attempts to separate those load_avg readers
>> (update_cfs_shares) and writers (task_tick_fair) to use different
>> cachelines instead. Writers of load_avg will now accumulates the
>> load delta into load_avg_delta which sits in a different cacheline.
>> If load_avg_delta is sufficiently large (> load_avg/64), it will then
>> be added back to load_avg.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
>> 480 threads), the perf profile before the patch was:
>>
>> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
>> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> After the patch, it became:
>>
>> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
>> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
>> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
> This begs the question tough; why are you running a global load in a
> cgroup; and do we really need to update this for the root cgroup? It
> seems to me we don't need calc_tg_weight() for the root cgroup, it
> doesn't need to normalize its weight numbers.
>
> That is; isn't this simply a problem we should avoid?

I didn't use any cgroup in my test setup. Autogroup was enabled, though.
Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
times.

>> The benchmark results before and after the patch were as follows:
>>
>> Before patch - Max-jOPs: 916011 Critical-jOps: 142366
>> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
>>
>> There was significant improvement in Critical-jOps which was latency
>> sensitive.
>>
>> This patch does introduce additional delay in getting the real load
>> average reflected in load_avg. It may also incur additional overhead
>> if the number of CPUs in a task group is small. As a result, this
>> change is only activated when running on a 4-socket or larger systems
>> which can get the most benefit from it.
> So I'm not particularly charmed by this; it rather makes a mess of
> things. Also this really wants a run of the cgroup fairness test thingy
> pjt/ben have somewhere.

I will be glad to run any additional tests, if necessary. I do need
pointers to those test, though.

>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/sched/core.c | 9 +++++++++
>> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
>> kernel/sched/sched.h | 8 ++++++++
>> 3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..f3075da 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>> ptr += nr_cpu_ids * sizeof(void **);
>>
>> +#ifdef CONFIG_SMP
>> + /*
>> + * Use load_avg_delta if not 2P or less
>> + */
>> + root_task_group.use_la_delta = (num_possible_nodes()> 2);
>> +#endif /* CONFIG_SMP */
>> #endif /* CONFIG_FAIR_GROUP_SCHED */
>> #ifdef CONFIG_RT_GROUP_SCHED
>> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>> if (!alloc_rt_sched_group(tg, parent))
>> goto err;
>>
>> +#if defined(CONFIG_FAIR_GROUP_SCHED)&& defined(CONFIG_SMP)
>> + tg->use_la_delta = root_task_group.use_la_delta;
>> +#endif
>> return tg;
>>
>> err:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8f1eccc..44732cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> /*
>> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
>> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>> * and effective_load (which is not done because it is too costly).
>> + *
>> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
>> + * accumulated into the load_avg_delta variable instead to reduce cacheline
>> + * contention on load_avg at the expense of more delay in reflecting the real
>> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
>> + * cachelines. With that flag set, load_avg will be read mostly whereas
>> + * load_avg_delta will be write mostly.
>> */
>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> {
>> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>
>> if (force || abs(delta)> cfs_rq->tg_load_avg_contrib / 64) {
>> - atomic_long_add(delta,&cfs_rq->tg->load_avg);
>> + struct task_group *tg = cfs_rq->tg;
>> + long load_avg, tot_delta;
>> +
>> + if (!tg->use_la_delta) {
>> + /*
>> + * If the use_la_delta isn't set, just add the
>> + * delta directly into load_avg.
>> + */
>> + atomic_long_add(delta,&tg->load_avg);
>> + goto set_contrib;
>> + }
>> +
>> + tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
>> + load_avg = atomic_long_read(&tg->load_avg);
>> + if (abs(tot_delta)> load_avg / 64) {
>> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
>> + if (tot_delta)
>> + atomic_long_add(tot_delta,&tg->load_avg);
>> + }
>> +set_contrib:
>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> }
>> }
> I'm thinking that its now far too big to retain the inline qualifier.

I can take the inline keyword out.

>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e679895..aef4e4e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -252,8 +252,16 @@ struct task_group {
>> * load_avg can be heavily contended at clock tick time, so put
>> * it in its own cacheline separated from the fields above which
>> * will also be accessed at each tick.
>> + *
>> + * The use_la_delta flag, if set, will enable the use of load_avg_delta
>> + * to accumulate the delta and only change load_avg when the delta
>> + * is big enough. This reduces the cacheline contention on load_avg.
>> + * This flag will be set at allocation time depending on the system
>> + * configuration.
>> */
>> + int use_la_delta;
>> atomic_long_t load_avg ____cacheline_aligned;
>> + atomic_long_t load_avg_delta ____cacheline_aligned;
> This would only work if the structure itself is allocated with cacheline
> alignment, and looking at sched_create_group(), we use a plain kzalloc()
> for this, which doesn't guarantee any sort of alignment beyond machine
> word size IIRC.

With a RHEL 6 derived .config file, the size of the task_group structure
was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
increase the size to 512 bytes. So it did make the structure a multiple
of the cacheline size. With both slub and slab, the allocated task group
pointers from kzalloc() in sched_create_group() were all multiples of
0x200. So they were properly aligned for the ____cacheline_aligned tag
to work.

> Also, you unconditionally grow the structure by a whole cacheline.

I know it is a drawback of using ____cacheline_aligned tag. However, we
probably won't create too many task groups in normal use. So the
increase in memory consumption shouldn't be noticeable.

Cheers,
Longman

2015-11-30 22:10:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This begs the question tough; why are you running a global load in a
> >cgroup; and do we really need to update this for the root cgroup? It
> >seems to me we don't need calc_tg_weight() for the root cgroup, it
> >doesn't need to normalize its weight numbers.
> >
> >That is; isn't this simply a problem we should avoid?
>
> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> times.

Yeah, can you kill autogroup and see if that helps? If not, we probably
should add some code to avoid calculating things for the root group.

2015-11-30 22:29:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This would only work if the structure itself is allocated with cacheline
> >alignment, and looking at sched_create_group(), we use a plain kzalloc()
> >for this, which doesn't guarantee any sort of alignment beyond machine
> >word size IIRC.
>
> With a RHEL 6 derived .config file, the size of the task_group structure was
> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
> increase the size to 512 bytes. So it did make the structure a multiple of
> the cacheline size. With both slub and slab, the allocated task group
> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
> So they were properly aligned for the ____cacheline_aligned tag to work.

Not sure we should rely on sl*b doing the right thing here.
KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
explicit alignment, one should use KMEM_CACHE().

2015-11-30 22:32:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> >Please always Cc the people who wrote the code.
> >
> >+CC pjt, ben, morten, yuyang
>
> Sorry for that. Their names didn't show up when I did get_maintainer.pl.

Ah, I never much use get_maintainers.pl, but it might that --git-blame
would yield some of the people who poked at this code. But I suspect its
all yuyang now, seeing how he recently reworked it all.

2015-12-01 03:55:08

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This begs the question tough; why are you running a global load in a
>>> cgroup; and do we really need to update this for the root cgroup? It
>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>> doesn't need to normalize its weight numbers.
>>>
>>> That is; isn't this simply a problem we should avoid?
>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>> times.
> Yeah, can you kill autogroup and see if that helps? If not, we probably
> should add some code to avoid calculating things for the root group.

I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in
the distribution kernels. So we still need to look at that with
autogroup enabled.

Cheers,
Longman

2015-12-01 04:00:43

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On 11/30/2015 05:29 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This would only work if the structure itself is allocated with cacheline
>>> alignment, and looking at sched_create_group(), we use a plain kzalloc()
>>> for this, which doesn't guarantee any sort of alignment beyond machine
>>> word size IIRC.
>> With a RHEL 6 derived .config file, the size of the task_group structure was
>> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
>> increase the size to 512 bytes. So it did make the structure a multiple of
>> the cacheline size. With both slub and slab, the allocated task group
>> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
>> So they were properly aligned for the ____cacheline_aligned tag to work.
> Not sure we should rely on sl*b doing the right thing here.
> KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
> explicit alignment, one should use KMEM_CACHE().

I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
requests except when the size is less than or equal to 192 where there
are some non-power-of-2 kmemcaches available. Given that the task_group
structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't
hit the case that the allocated buffer is not cacheline aligned.

Cheers,
Longman

2015-12-01 08:47:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:

> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
> requests except when the size is less than or equal to 192 where there are
> some non-power-of-2 kmemcaches available. Given that the task_group
> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
> the case that the allocated buffer is not cacheline aligned.

Using out-of-object storage is allowed (none of the existing sl*b
allocators do so iirc).

That is, its perfectly valid for a sl*b allocator for 64 byte objects to
allocate 72 bytes for each object and use the 'spare' 8 bytes for object
tracking or whatnot.

That would respect the minimum alignment guarantee of 8 bytes but not
provide the 'expected' object size alignment you're assuming.

Also, we have the proper interfaces to request the explicit alignment
for a reason. So if you need the alignment for correctness, use those.

2015-12-01 08:49:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >>>This begs the question tough; why are you running a global load in a
> >>>cgroup; and do we really need to update this for the root cgroup? It
> >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> >>>doesn't need to normalize its weight numbers.
> >>>
> >>>That is; isn't this simply a problem we should avoid?
> >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> >>times.
> >Yeah, can you kill autogroup and see if that helps? If not, we probably
> >should add some code to avoid calculating things for the root group.
>
> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> distribution kernels. So we still need to look at that with autogroup
> enabled.

Meh, or just tell the people that have stupid large machines to use
noautogroup on boot (its of questionable benefit in the first place imo,
esp. on servers).

2015-12-01 10:44:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On Tue, 2015-12-01 at 09:49 +0100, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> > On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> > >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> > >>>This begs the question tough; why are you running a global load in a
> > >>>cgroup; and do we really need to update this for the root cgroup? It
> > >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> > >>>doesn't need to normalize its weight numbers.
> > >>>
> > >>>That is; isn't this simply a problem we should avoid?
> > >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> > >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> > >>times.
> > >Yeah, can you kill autogroup and see if that helps? If not, we probably
> > >should add some code to avoid calculating things for the root group.
> >
> > I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> > distribution kernels. So we still need to look at that with autogroup
> > enabled.
>
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).

Yup (and yup). Someone should also suggest to the systemd(isease) folks
that they try actually measuring before turning everything in the world
on. "Oo cgroups are cool" is NOT a good reason to turn it all on :)

-Mike

2015-12-02 18:44:54

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On 12/01/2015 03:47 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:
>
>> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
>> requests except when the size is less than or equal to 192 where there are
>> some non-power-of-2 kmemcaches available. Given that the task_group
>> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
>> the case that the allocated buffer is not cacheline aligned.
> Using out-of-object storage is allowed (none of the existing sl*b
> allocators do so iirc).
>
> That is, its perfectly valid for a sl*b allocator for 64 byte objects to
> allocate 72 bytes for each object and use the 'spare' 8 bytes for object
> tracking or whatnot.
>
> That would respect the minimum alignment guarantee of 8 bytes but not
> provide the 'expected' object size alignment you're assuming.
>
> Also, we have the proper interfaces to request the explicit alignment
> for a reason. So if you need the alignment for correctness, use those.

Thanks for the tip. I have just sent out an updated patch set which
create a cache-aligned memcache for task group. That should work under
all kernel config setting.

Cheers,
Longman

2015-12-02 18:48:35

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

On 12/01/2015 03:49 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
>> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
>>> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>>>> This begs the question tough; why are you running a global load in a
>>>>> cgroup; and do we really need to update this for the root cgroup? It
>>>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>>>> doesn't need to normalize its weight numbers.
>>>>>
>>>>> That is; isn't this simply a problem we should avoid?
>>>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>>>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>>>> times.
>>> Yeah, can you kill autogroup and see if that helps? If not, we probably
>>> should add some code to avoid calculating things for the root group.
>> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
>> distribution kernels. So we still need to look at that with autogroup
>> enabled.
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).

Yes, I was able to recover most of the lost performance by disabling
autogroup. I did send out a new patch to disable load_avg update for
root_task_group. I need that for backporting to earlier kernels which
was forced to update load_avg for every clock tick even for root_task_group.

Cheers,
Longman

Subject: [tip:sched/core] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Commit-ID: a426f99c91d1036767a7819aaaba6bd3191b7f06
Gitweb: http://git.kernel.org/tip/a426f99c91d1036767a7819aaaba6bd3191b7f06
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 25 Nov 2015 14:09:38 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 4 Dec 2015 10:34:47 +0100

sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efd664c..4b0e8b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6398,7 +6398,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
bool *overload)
{
unsigned long load;
- int i;
+ int i, nr_running;

memset(sgs, 0, sizeof(*sgs));

@@ -6415,7 +6415,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

- if (rq->nr_running > 1)
+ nr_running = rq->nr_running;
+ if (nr_running > 1)
*overload = true;

#ifdef CONFIG_NUMA_BALANCING
@@ -6423,7 +6424,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
sgs->sum_weighted_load += weighted_cpuload(i);
- if (idle_cpu(i))
+ /*
+ * No need to call idle_cpu() if nr_running is not 0
+ */
+ if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
}