2015-12-02 18:42:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg

v1->v2:
- Make a memcache for task_group to make sure that the allocated
task_group object will always be on cacheline boundary even if
debugging is turned on.
- Scrap the original patch 3 and replace it with another one to
disable load_avg update for root_task_group.

This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().

Waiman Long (3):
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
sched/fair: Move hot load_avg into its own cacheline
sched/fair: Disable tg load_avg update for root_task_group

kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
kernel/sched/fair.c | 16 +++++++++++++---
kernel/sched/sched.h | 7 ++++++-
3 files changed, 53 insertions(+), 6 deletions(-)


2015-12-02 18:42:09

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/fair.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8..8f1eccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
bool *overload)
{
unsigned long load;
- int i;
+ int i, nr_running;

memset(sgs, 0, sizeof(*sgs));

@@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

- if (rq->nr_running > 1)
+ nr_running = rq->nr_running;
+ if (nr_running > 1)
*overload = true;

#ifdef CONFIG_NUMA_BALANCING
@@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
sgs->sum_weighted_load += weighted_cpuload(i);
- if (idle_cpu(i))
+ /*
+ * No need to call idle_cpu() if nr_running is not 0
+ */
+ if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
}

--
1.7.1

2015-12-02 18:42:49

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 7 ++++++-
2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..e39204f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
+#endif
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7356,6 +7361,7 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

+ task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+ return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+ kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+ return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+ kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
static void free_sched_group(struct task_group *tg)
{
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
- kfree(tg);
+ free_task_group(tg);
}

/* allocate runqueue etc for a new task group */
@@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
{
struct task_group *tg;

- tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ tg = alloc_task_group();
if (!tg)
return ERR_PTR(-ENOMEM);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;

#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif

--
1.7.1

2015-12-02 18:42:26

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..4607cb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2670,6 +2670,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

+ /*
+ * No need to update load_avg for root_task_group as it is not used.
+ */
+ if (cfs_rq->tg == &root_task_group)
+ return;
+
if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
--
1.7.1

2015-12-02 19:55:48

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group

Waiman Long <[email protected]> writes:

> Currently, the update_tg_load_avg() function attempts to update the
> tg's load_avg value whenever the load changes even for root_task_group
> where the load_avg value will never be used. This patch will disable
> the load_avg update when the given task group is the root_task_group.
>
> Running a Java benchmark with noautogroup and a 4.3 kernel on a
> 16-socket IvyBridge-EX system, the amount of CPU time (as reported by
> perf) consumed by task_tick_fair() which includes update_tg_load_avg()
> decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
> results also increased slightly from 983015 to 986449.
>
> Signed-off-by: Waiman Long <[email protected]>

Reviewed-by: Ben Segall <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..4607cb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2670,6 +2670,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> + /*
> + * No need to update load_avg for root_task_group as it is not used.
> + */
> + if (cfs_rq->tg == &root_task_group)
> + return;
> +
> if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

2015-12-02 20:02:37

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

Waiman Long <[email protected]> writes:

> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
> 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
> 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
> 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
> 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq & se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
> Before patch - Max-jOPs: 907533 Critical-jOps: 134877
> After patch - Max-jOPs: 916011 Critical-jOps: 142366
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 7 ++++++-
> 2 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..e39204f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
> */
> struct task_group root_task_group;
> LIST_HEAD(task_groups);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
> +#endif
> #endif
>
> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
> ptr += nr_cpu_ids * sizeof(void **);
>
> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
____cacheline_aligned_in_smp to the struct definition instead.

> #endif /* CONFIG_FAIR_GROUP_SCHED */
> #ifdef CONFIG_RT_GROUP_SCHED
> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
> /* task_group_lock serializes the addition/removal of task groups */
> static DEFINE_SPINLOCK(task_group_lock);
>
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
> static void free_sched_group(struct task_group *tg)
> {
> free_fair_sched_group(tg);
> free_rt_sched_group(tg);
> autogroup_free(tg);
> - kfree(tg);
> + free_task_group(tg);
> }
>
> /* allocate runqueue etc for a new task group */
> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
> {
> struct task_group *tg;
>
> - tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> + tg = alloc_task_group();
> if (!tg)
> return ERR_PTR(-ENOMEM);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..e679895 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
> unsigned long shares;
>
> #ifdef CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> + * load_avg can be heavily contended at clock tick time, so put
> + * it in its own cacheline separated from the fields above which
> + * will also be accessed at each tick.
> + */
> + atomic_long_t load_avg ____cacheline_aligned;
> #endif
> #endif

I suppose the question is if it would be better to just move this to
wind up on a separate cacheline without the extra empty space, though it
would likely be more fragile and unclear.

2015-12-03 04:32:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:

> By doing so, the perf profile became:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

-Mike

2015-12-03 10:56:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.

2015-12-03 11:12:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline



I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long <[email protected]>
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366

Cc: Scott J Norton <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Yuyang Du <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 10 +++++++---
kernel/sched/sched.h | 7 ++++++-
2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
+ task_group_cache = KMEM_CACHE(task_group, 0);
+
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
autogroup_init(&init_task);
-
#endif /* CONFIG_CGROUP_SCHED */

for_each_possible_cpu(i) {
@@ -7687,7 +7691,7 @@ static void free_sched_group(struct task
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
- kfree(tg);
+ kmem_cache_free(task_group_cache, tg);
}

/* allocate runqueue etc for a new task group */
@@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st
{
struct task_group *tg;

- tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
if (!tg)
return ERR_PTR(-ENOMEM);

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;

#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif

2015-12-03 18:01:52

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

Peter Zijlstra <[email protected]> writes:

> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long <[email protected]>
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
[...]
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> #endif /* CONFIG_RT_GROUP_SCHED */
>
> #ifdef CONFIG_CGROUP_SCHED
> + task_group_cache = KMEM_CACHE(task_group, 0);
> +
> list_add(&root_task_group.list, &task_groups);
> INIT_LIST_HEAD(&root_task_group.children);
> INIT_LIST_HEAD(&root_task_group.siblings);
> autogroup_init(&init_task);
> -
> #endif /* CONFIG_CGROUP_SCHED */
>
> for_each_possible_cpu(i) {
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
> unsigned long shares;
>
> #ifdef CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> + * load_avg can be heavily contended at clock tick time, so put
> + * it in its own cacheline separated from the fields above which
> + * will also be accessed at each tick.
> + */
> + atomic_long_t load_avg ____cacheline_aligned;
> #endif
> #endif
>

This loses the cacheline-alignment for task_group, is that ok?

2015-12-03 18:17:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On Thu, Dec 03, 2015 at 09:56:02AM -0800, [email protected] wrote:
> Peter Zijlstra <[email protected]> writes:

> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> > #endif /* CONFIG_RT_GROUP_SCHED */
> >
> > #ifdef CONFIG_CGROUP_SCHED
> > + task_group_cache = KMEM_CACHE(task_group, 0);
> > +
> > list_add(&root_task_group.list, &task_groups);
> > INIT_LIST_HEAD(&root_task_group.children);
> > INIT_LIST_HEAD(&root_task_group.siblings);
> > autogroup_init(&init_task);
> > -
> > #endif /* CONFIG_CGROUP_SCHED */
> >
> > for_each_possible_cpu(i) {
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -248,7 +248,12 @@ struct task_group {
> > unsigned long shares;
> >
> > #ifdef CONFIG_SMP
> > - atomic_long_t load_avg;
> > + /*
> > + * load_avg can be heavily contended at clock tick time, so put
> > + * it in its own cacheline separated from the fields above which
> > + * will also be accessed at each tick.
> > + */
> > + atomic_long_t load_avg ____cacheline_aligned;
> > #endif
> > #endif
> >
>
> This loses the cacheline-alignment for task_group, is that ok?

I'm a bit dense (its late) can you spell that out? Did you mean me
killing SLAB_HWCACHE_ALIGN? That should not matter because:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)

picks up the alignment explicitly.

And struct task_group having one cacheline aligned member, means that
the alignment of the composite object (the struct proper) must be an
integer multiple of this (typically 1).

2015-12-03 18:23:06

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

Peter Zijlstra <[email protected]> writes:

> On Thu, Dec 03, 2015 at 09:56:02AM -0800, [email protected] wrote:
>> Peter Zijlstra <[email protected]> writes:
>
>> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>> > #endif /* CONFIG_RT_GROUP_SCHED */
>> >
>> > #ifdef CONFIG_CGROUP_SCHED
>> > + task_group_cache = KMEM_CACHE(task_group, 0);
>> > +
>> > list_add(&root_task_group.list, &task_groups);
>> > INIT_LIST_HEAD(&root_task_group.children);
>> > INIT_LIST_HEAD(&root_task_group.siblings);
>> > autogroup_init(&init_task);
>> > -
>> > #endif /* CONFIG_CGROUP_SCHED */
>> >
>> > for_each_possible_cpu(i) {
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -248,7 +248,12 @@ struct task_group {
>> > unsigned long shares;
>> >
>> > #ifdef CONFIG_SMP
>> > - atomic_long_t load_avg;
>> > + /*
>> > + * load_avg can be heavily contended at clock tick time, so put
>> > + * it in its own cacheline separated from the fields above which
>> > + * will also be accessed at each tick.
>> > + */
>> > + atomic_long_t load_avg ____cacheline_aligned;
>> > #endif
>> > #endif
>> >
>>
>> This loses the cacheline-alignment for task_group, is that ok?
>
> I'm a bit dense (its late) can you spell that out? Did you mean me
> killing SLAB_HWCACHE_ALIGN? That should not matter because:
>
> #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
> sizeof(struct __struct), __alignof__(struct __struct),\
> (__flags), NULL)
>
> picks up the alignment explicitly.
>
> And struct task_group having one cacheline aligned member, means that
> the alignment of the composite object (the struct proper) must be an
> integer multiple of this (typically 1).

Ah, yeah, I forgot about this, my fault.

2015-12-03 19:26:11

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On 12/02/2015 03:02 PM, [email protected] wrote:
> Waiman Long<[email protected]> writes:
>
>> If a system with large number of sockets was driven to full
>> utilization, it was found that the clock tick handling occupied a
>> rather significant proportion of CPU time when fair group scheduling
>> and autogroup were enabled.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
>> profile looked like:
>>
>> 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
>> 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> In particular, the high CPU time consumed by update_cfs_shares()
>> was mostly due to contention on the cacheline that contained the
>> task_group's load_avg statistical counter. This cacheline may also
>> contains variables like shares, cfs_rq& se which are accessed rather
>> frequently during clock tick processing.
>>
>> This patch moves the load_avg variable into another cacheline
>> separated from the other frequently accessed variables. It also
>> creates a cacheline aligned kmemcache for task_group to make sure
>> that all the allocated task_group's are cacheline aligned.
>>
>> By doing so, the perf profile became:
>>
>> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
>> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before. The
>> benchmark results before and after the patch was as follows:
>>
>> Before patch - Max-jOPs: 907533 Critical-jOps: 134877
>> After patch - Max-jOPs: 916011 Critical-jOps: 142366
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
>> kernel/sched/sched.h | 7 ++++++-
>> 2 files changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..e39204f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>> */
>> struct task_group root_task_group;
>> LIST_HEAD(task_groups);
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +/* Cacheline aligned slab cache for task_group */
>> +static struct kmem_cache *task_group_cache __read_mostly;
>> +#endif
>> #endif
>>
>> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>> ptr += nr_cpu_ids * sizeof(void **);
>>
>> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
> The KMEM_CACHE macro suggests instead adding
> ____cacheline_aligned_in_smp to the struct definition instead.

The main goal is to have the load_avg placed in a new cacheline
separated from the read-only fields above. That is why I placed
____cacheline_aligned after load_avg. I omitted the in_smp part because
it is in the SMP block already. Putting ____cacheline_aligned_in_smp
won't guarantee alignment of any field within the structure.

I have done some test and having ____cacheline_aligned inside the
structure has the same effect of forcing the whole structure in the
cacheline aligned boundary.

>> #endif /* CONFIG_FAIR_GROUP_SCHED */
>> #ifdef CONFIG_RT_GROUP_SCHED
>> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>> /* task_group_lock serializes the addition/removal of task groups */
>> static DEFINE_SPINLOCK(task_group_lock);
>>
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>> static void free_sched_group(struct task_group *tg)
>> {
>> free_fair_sched_group(tg);
>> free_rt_sched_group(tg);
>> autogroup_free(tg);
>> - kfree(tg);
>> + free_task_group(tg);
>> }
>>
>> /* allocate runqueue etc for a new task group */
>> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>> {
>> struct task_group *tg;
>>
>> - tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>> + tg = alloc_task_group();
>> if (!tg)
>> return ERR_PTR(-ENOMEM);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index efd3bfc..e679895 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -248,7 +248,12 @@ struct task_group {
>> unsigned long shares;
>>
>> #ifdef CONFIG_SMP
>> - atomic_long_t load_avg;
>> + /*
>> + * load_avg can be heavily contended at clock tick time, so put
>> + * it in its own cacheline separated from the fields above which
>> + * will also be accessed at each tick.
>> + */
>> + atomic_long_t load_avg ____cacheline_aligned;
>> #endif
>> #endif
> I suppose the question is if it would be better to just move this to
> wind up on a separate cacheline without the extra empty space, though it
> would likely be more fragile and unclear.

I have been thinking about that too. The problem is anything that will
be in the same cacheline as load_avg and have to be accessed at clock
click time will cause the same contention problem. In the current
layout, the fields after load_avg are the rt stuff as well some list
head structure and pointers. The rt stuff should be kind of mutually
exclusive of the CFS load_avg in term of usage. The list head structure
and pointers don't seem to be that frequently accessed. So it is the
right place to start a new cacheline boundary.

Cheers,
Longman

2015-12-03 19:34:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On 12/02/2015 11:32 PM, Mike Galbraith wrote:
> On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:
>
>> By doing so, the perf profile became:
>>
>> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
>> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before.
> Is that with the box booted skew_tick=1?
>
> -Mike
>

I haven't tried that kernel parameter. I will try it to see if it can
improve the situation. BTW, will there be other undesirable side effects
of using this other than the increased power consumption as said in the
kernel-parameters.txt file?

Cheers,
Longman

2015-12-03 19:38:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On 12/03/2015 05:56 AM, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> I think we can simply always use the kmem_cache, both slab and slub
> merge slabcaches where appropriate.

I did that as I was not sure how much overhead would the introduction of
a new kmem_cache bring. It seems like it is not really an issue. So I am
fine with making the kmem_cache change permanent.

Cheers,
Longman

2015-12-03 19:41:08

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

Waiman Long <[email protected]> writes:

> On 12/02/2015 03:02 PM, [email protected] wrote:
>> Waiman Long<[email protected]> writes:
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index efd3bfc..e679895 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -248,7 +248,12 @@ struct task_group {
>>> unsigned long shares;
>>>
>>> #ifdef CONFIG_SMP
>>> - atomic_long_t load_avg;
>>> + /*
>>> + * load_avg can be heavily contended at clock tick time, so put
>>> + * it in its own cacheline separated from the fields above which
>>> + * will also be accessed at each tick.
>>> + */
>>> + atomic_long_t load_avg ____cacheline_aligned;
>>> #endif
>>> #endif
>> I suppose the question is if it would be better to just move this to
>> wind up on a separate cacheline without the extra empty space, though it
>> would likely be more fragile and unclear.
>
> I have been thinking about that too. The problem is anything that will be in the
> same cacheline as load_avg and have to be accessed at clock click time will
> cause the same contention problem. In the current layout, the fields after
> load_avg are the rt stuff as well some list head structure and pointers. The rt
> stuff should be kind of mutually exclusive of the CFS load_avg in term of usage.
> The list head structure and pointers don't seem to be that frequently accessed.
> So it is the right place to start a new cacheline boundary.
>
> Cheers,
> Longman

Yeah, this is a good place to start a new boundary, I was just saying
you could probably remove the empty space by reordering fields, but that
would be a less logical ordering in terms of programmer clarity.

2015-12-03 19:56:42

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On 12/03/2015 06:12 AM, Peter Zijlstra wrote:
>
> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long<[email protected]>
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
> 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
> 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
> 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
> 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq& se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
> Before patch - Max-jOPs: 907533 Critical-jOps: 134877
> After patch - Max-jOPs: 916011 Critical-jOps: 142366
>
> Cc: Scott J Norton<[email protected]>
> Cc: Douglas Hatch<[email protected]>
> Cc: Ingo Molnar<[email protected]>
> Cc: Yuyang Du<[email protected]>
> Cc: Paul Turner<[email protected]>
> Cc: Ben Segall<[email protected]>
> Cc: Morten Rasmussen<[email protected]>
> Signed-off-by: Waiman Long<[email protected]>
> Signed-off-by: Peter Zijlstra (Intel)<[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> kernel/sched/core.c | 10 +++++++---
> kernel/sched/sched.h | 7 ++++++-
> 2 files changed, 13 insertions(+), 4 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
> */
> struct task_group root_task_group;
> LIST_HEAD(task_groups);
> +
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
> #endif
>
> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> #endif /* CONFIG_RT_GROUP_SCHED */
>
> #ifdef CONFIG_CGROUP_SCHED
> + task_group_cache = KMEM_CACHE(task_group, 0);
> +
Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper
flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is
defined. Other than that, I am fine with the change.

Cheers,
Longman

2015-12-03 20:03:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote:
> > #ifdef CONFIG_CGROUP_SCHED
> >+ task_group_cache = KMEM_CACHE(task_group, 0);
> >+
> Thanks for making that change.
>
> Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper
> flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined.
> Other than that, I am fine with the change.

I don't think we need that, see my reply earlier to Ben.

2015-12-04 02:08:02

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
> On 12/02/2015 11:32 PM, Mike Galbraith wrote:

> > Is that with the box booted skew_tick=1?

> I haven't tried that kernel parameter. I will try it to see if it can
> improve the situation. BTW, will there be other undesirable side effects
> of using this other than the increased power consumption as said in the
> kernel-parameters.txt file?

Not that are known. I kinda doubt you'd notice the power, but you
should see a notable performance boost. Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

-Mike

Subject: [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline

Commit-ID: b0367629acf62a78404c467cd09df447c2fea804
Gitweb: http://git.kernel.org/tip/b0367629acf62a78404c467cd09df447c2fea804
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 2 Dec 2015 13:41:49 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 4 Dec 2015 10:34:48 +0100

sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yuyang Du <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 10 +++++++---
kernel/sched/sched.h | 7 ++++++-
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d591db1..aa3f978 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7370,6 +7370,9 @@ int in_sched_functions(unsigned long addr)
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7427,11 +7430,12 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
+ task_group_cache = KMEM_CACHE(task_group, 0);
+
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
autogroup_init(&init_task);
-
#endif /* CONFIG_CGROUP_SCHED */

for_each_possible_cpu(i) {
@@ -7712,7 +7716,7 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
- kfree(tg);
+ kmem_cache_free(task_group_cache, tg);
}

/* allocate runqueue etc for a new task group */
@@ -7720,7 +7724,7 @@ struct task_group *sched_create_group(struct task_group *parent)
{
struct task_group *tg;

- tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
if (!tg)
return ERR_PTR(-ENOMEM);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 472cd14..a5a6b3e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;

#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif

Subject: [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group

Commit-ID: aa0b7ae06387d40a988ce16a189082dee6e570bc
Gitweb: http://git.kernel.org/tip/aa0b7ae06387d40a988ce16a189082dee6e570bc
Author: Waiman Long <[email protected]>
AuthorDate: Wed, 2 Dec 2015 13:41:50 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 4 Dec 2015 10:34:49 +0100

sched/fair: Disable the task group load_avg update for the root_task_group

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ben Segall <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yuyang Du <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b0e8b8..1093873 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2709,6 +2709,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

+ /*
+ * No need to update load_avg for root_task_group as it is not used.
+ */
+ if (cfs_rq->tg == &root_task_group)
+ return;
+
if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

2015-12-04 20:19:25

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

On 12/03/2015 09:07 PM, Mike Galbraith wrote:
> On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
>> On 12/02/2015 11:32 PM, Mike Galbraith wrote:
>>> Is that with the box booted skew_tick=1?
>> I haven't tried that kernel parameter. I will try it to see if it can
>> improve the situation. BTW, will there be other undesirable side effects
>> of using this other than the increased power consumption as said in the
>> kernel-parameters.txt file?
> Not that are known. I kinda doubt you'd notice the power, but you
> should see a notable performance boost. Who knows, with a big enough
> farm of busy big boxen, it may save power by needing fewer of them.
>
> -Mike
>

You are right. Use skew_tick=1 did improve performance and reduce the
overhead of clock tick processing rather significantly. I think it
should be the default for large SMP boxes. Thanks for the tip.

Cheers,
Longman