LinuxLists.cc - [PATCH v2] sched: reduce contention on tg's load_avg & runnable

2014-01-16 02:23:23

Subject: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

It was found that with a perf profile of a compute workload (at 1500
users) of the AIM7 benchmark running on a glueless 4-socket 40-core
Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
tick related functions account for quite a significant portion of
the total kernel cpu cycles.

0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
0.47% reaim [kernel.kallsyms] [k] entity_tick
0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
0.03% reaim [kernel.kallsyms] [k] update_curr

The scheduling tick functions account for about 1.22% of the total
CPU cycles. Of the top 2 function in the above list, the reading
and writing of the tg->load_avg variable account for over 90% of the
CPU cycles:

atomic_long_add(tg_contrib, &tg->load_avg);
atomic_long_read(&tg->load_avg) + 1);

This patch reduces the contention on the load_avg variable (and
secondarily on the runnable_avg variable) by the following 2 measures:

1. Make the load_avg and runnable_avg fields of the task_group
structure sit in their own cacheline without sharing it with others.
This only applies if the kernel is built for NUMA systems with
multiple sockets.

2. Use atomic_long_add_return() to update the fields and save the
returned value in a temporary location in the cfs structure to
be used later instead of reading the fields directly.

The second change does require some changes in the ordering of how
some of the average counts are being computed and hence may have a
slight effect on their behavior.

With these 2 changes, the perf profile becomes:

0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
0.05% reaim [kernel.kallsyms] [k] update_cfs_shares
0.04% reaim [kernel.kallsyms] [k] update_curr
0.04% reaim [kernel.kallsyms] [k] entity_tick

The %CPU cycle is reduced to about 0.55%. It is not a big change,
but it did improve the compute benchmark slightly from 398509 JPM
(Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced
the reported systime from 50.03s to 48.37s.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/sched/fair.c | 29 ++++++++++++++++++++++-------
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7395d9..c4aa86d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
* to gain a more accurate current total weight. See
* update_cfs_rq_load_contribution().
*/
- tg_weight = atomic_long_read(&tg->load_avg);
+ /* Use the saved version of tg's load_avg, if available */
+ tg_weight = cfs_rq->tg_load_save;
+ if (!tg_weight)
+ tg_weight = atomic_long_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_contrib;
tg_weight += cfs_rq->load.weight;

@@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
tg_contrib -= cfs_rq->tg_load_contrib;

if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
- atomic_long_add(tg_contrib, &tg->load_avg);
+ cfs_rq->tg_load_save =
+ atomic_long_add_return(tg_contrib, &tg->load_avg);
cfs_rq->tg_load_contrib += tg_contrib;
}
}
@@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
contrib -= cfs_rq->tg_runnable_contrib;

if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
- atomic_add(contrib, &tg->runnable_avg);
+ cfs_rq->tg_runnable_save =
+ atomic_add_return(contrib, &tg->runnable_avg);
cfs_rq->tg_runnable_contrib += contrib;
}
}
@@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
struct cfs_rq *cfs_rq = group_cfs_rq(se);
struct task_group *tg = cfs_rq->tg;
int runnable_avg;
+ long load_avg;

u64 contrib;

contrib = cfs_rq->tg_load_contrib * tg->shares;
- se->avg.load_avg_contrib = div_u64(contrib,
- atomic_long_read(&tg->load_avg) + 1);
+ /*
+ * Retrieve & clear the saved tg's load_avg and use it if not 0
+ */
+ load_avg = cfs_rq->tg_load_save;
+ cfs_rq->tg_load_save = 0;
+ if (unlikely(!load_avg))
+ load_avg = atomic_long_read(&tg->load_avg);
+ se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1);

/*
* For group entities we need to compute a correction term in the case
@@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
* of consequential size guaranteed to see n_i*w_i quickly converge to
* our upper bound of 1-cpu.
*/
- runnable_avg = atomic_read(&tg->runnable_avg);
+ runnable_avg = cfs_rq->tg_runnable_save;
+ cfs_rq->tg_runnable_save = 0;
+ if (unlikely(!runnable_avg))
+ runnable_avg = atomic_read(&tg->runnable_avg);
if (runnable_avg < NICE_0_LOAD) {
se->avg.load_avg_contrib *= runnable_avg;
se->avg.load_avg_contrib >>= NICE_0_SHIFT;
@@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
/*
* Ensure that runnable average is periodically updated.
*/
- update_entity_load_avg(curr, 1);
update_cfs_rq_blocked_load(cfs_rq, 1);
update_cfs_shares(cfs_rq);
+ update_entity_load_avg(curr, 1);

#ifdef CONFIG_SCHED_HRTICK
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88c85b2..f425630 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -11,6 +11,14 @@
#include "cpupri.h"
#include "cpuacct.h"

+#ifndef ____cacheline_aligned_in_numa
+#ifdef CONFIG_NUMA
+#define ____cacheline_aligned_in_numa ____cacheline_aligned
+#else
+#define ____cacheline_aligned_in_numa
+#endif
+#endif
+
struct rq;

extern __read_mostly int scheduler_running;
@@ -150,8 +158,8 @@ struct task_group {
unsigned long shares;

#ifdef CONFIG_SMP
- atomic_long_t load_avg;
- atomic_t runnable_avg;
+ atomic_long_t load_avg ____cacheline_aligned_in_numa;
+ atomic_t runnable_avg ____cacheline_aligned_in_numa;
#endif
#endif

@@ -285,7 +293,9 @@ struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
+ int tg_runnable_save;
unsigned long tg_load_contrib;
+ long tg_load_save;

/*
* h_load = weight * f(tg)
--
1.7.1

2014-01-16 12:45:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

First of.. WTF is v1?

Secondly, please always CC the authors of the code you're changing.

On Wed, Jan 15, 2014 at 09:22:36PM -0500, Waiman Long wrote:
> It was found that with a perf profile of a compute workload (at 1500
> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
> tick related functions account for quite a significant portion of
> the total kernel cpu cycles.
>
> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 0.47% reaim [kernel.kallsyms] [k] entity_tick
> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
> 0.03% reaim [kernel.kallsyms] [k] update_curr
>
> The scheduling tick functions account for about 1.22% of the total
> CPU cycles. Of the top 2 function in the above list, the reading
> and writing of the tg->load_avg variable account for over 90% of the
> CPU cycles:
>
> atomic_long_add(tg_contrib, &tg->load_avg);
> atomic_long_read(&tg->load_avg) + 1);
>
> This patch reduces the contention on the load_avg variable (and
> secondarily on the runnable_avg variable) by the following 2 measures:
>
> 1. Make the load_avg and runnable_avg fields of the task_group
> structure sit in their own cacheline without sharing it with others.
> This only applies if the kernel is built for NUMA systems with
> multiple sockets.

So why not for SMP?

Also, what's the difference between having both of them in the same
cacheline as opposed to a cacheline each?

They're both touched from the same tick, so it makes sense to have them
in one cacheline. Now you get to move two lines into exclusive state,
instead of just the one.

> 2. Use atomic_long_add_return() to update the fields and save the
> returned value in a temporary location in the cfs structure to
> be used later instead of reading the fields directly.

Then why aren't this two patches?

Furthermore, I completely hate the way you implemented this; the stuff
like in the first hunk below makes the entire code flow horrid. Its
already difficult code, using conditional variables makes it even worse.

Who's to say your 'cached' value is recent? You didn't put in a call
chain analysis to show you always first pass through the add_return()
before using the cached value.

> The second change does require some changes in the ordering of how
> some of the average counts are being computed and hence may have a
> slight effect on their behavior.

Might have is no good, either you work through it and make damn sure its
solid or you walk.

Preserved the rest for the added Cc's.

> With these 2 changes, the perf profile becomes:
>
> 0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 0.05% reaim [kernel.kallsyms] [k] update_cfs_shares
> 0.04% reaim [kernel.kallsyms] [k] update_curr
> 0.04% reaim [kernel.kallsyms] [k] entity_tick
>
> The %CPU cycle is reduced to about 0.55%. It is not a big change,
> but it did improve the compute benchmark slightly from 398509 JPM
> (Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced
> the reported systime from 50.03s to 48.37s.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/sched/fair.c | 29 ++++++++++++++++++++++-------
> kernel/sched/sched.h | 14 ++++++++++++--
> 2 files changed, 34 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c7395d9..c4aa86d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
> * to gain a more accurate current total weight. See
> * update_cfs_rq_load_contribution().
> */
> - tg_weight = atomic_long_read(&tg->load_avg);
> + /* Use the saved version of tg's load_avg, if available */
> + tg_weight = cfs_rq->tg_load_save;
> + if (!tg_weight)
> + tg_weight = atomic_long_read(&tg->load_avg);
> tg_weight -= cfs_rq->tg_load_contrib;
> tg_weight += cfs_rq->load.weight;
>
> @@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> tg_contrib -= cfs_rq->tg_load_contrib;
>
> if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> - atomic_long_add(tg_contrib, &tg->load_avg);
> + cfs_rq->tg_load_save =
> + atomic_long_add_return(tg_contrib, &tg->load_avg);
> cfs_rq->tg_load_contrib += tg_contrib;
> }
> }
> @@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> contrib -= cfs_rq->tg_runnable_contrib;
>
> if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> - atomic_add(contrib, &tg->runnable_avg);
> + cfs_rq->tg_runnable_save =
> + atomic_add_return(contrib, &tg->runnable_avg);
> cfs_rq->tg_runnable_contrib += contrib;
> }
> }
> @@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
> struct cfs_rq *cfs_rq = group_cfs_rq(se);
> struct task_group *tg = cfs_rq->tg;
> int runnable_avg;
> + long load_avg;
>
> u64 contrib;
>
> contrib = cfs_rq->tg_load_contrib * tg->shares;
> - se->avg.load_avg_contrib = div_u64(contrib,
> - atomic_long_read(&tg->load_avg) + 1);
> + /*
> + * Retrieve & clear the saved tg's load_avg and use it if not 0
> + */
> + load_avg = cfs_rq->tg_load_save;
> + cfs_rq->tg_load_save = 0;
> + if (unlikely(!load_avg))
> + load_avg = atomic_long_read(&tg->load_avg);
> + se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1);
>
> /*
> * For group entities we need to compute a correction term in the case
> @@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
> * of consequential size guaranteed to see n_i*w_i quickly converge to
> * our upper bound of 1-cpu.
> */
> - runnable_avg = atomic_read(&tg->runnable_avg);
> + runnable_avg = cfs_rq->tg_runnable_save;
> + cfs_rq->tg_runnable_save = 0;
> + if (unlikely(!runnable_avg))
> + runnable_avg = atomic_read(&tg->runnable_avg);
> if (runnable_avg < NICE_0_LOAD) {
> se->avg.load_avg_contrib *= runnable_avg;
> se->avg.load_avg_contrib >>= NICE_0_SHIFT;
> @@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> /*
> * Ensure that runnable average is periodically updated.
> */
> - update_entity_load_avg(curr, 1);
> update_cfs_rq_blocked_load(cfs_rq, 1);
> update_cfs_shares(cfs_rq);
> + update_entity_load_avg(curr, 1);
>
> #ifdef CONFIG_SCHED_HRTICK
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 88c85b2..f425630 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -11,6 +11,14 @@
> #include "cpupri.h"
> #include "cpuacct.h"
>
> +#ifndef ____cacheline_aligned_in_numa
> +#ifdef CONFIG_NUMA
> +#define ____cacheline_aligned_in_numa ____cacheline_aligned
> +#else
> +#define ____cacheline_aligned_in_numa
> +#endif
> +#endif
> +
> struct rq;
>
> extern __read_mostly int scheduler_running;
> @@ -150,8 +158,8 @@ struct task_group {
> unsigned long shares;
>
> #ifdef CONFIG_SMP
> - atomic_long_t load_avg;
> - atomic_t runnable_avg;
> + atomic_long_t load_avg ____cacheline_aligned_in_numa;
> + atomic_t runnable_avg ____cacheline_aligned_in_numa;
> #endif
> #endif
>
> @@ -285,7 +293,9 @@ struct cfs_rq {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* Required to track per-cpu representation of a task_group */
> u32 tg_runnable_contrib;
> + int tg_runnable_save;
> unsigned long tg_load_contrib;
> + long tg_load_save;
>
> /*
> * h_load = weight * f(tg)
> --
> 1.7.1
>

2014-01-16 18:27:34

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

Waiman Long <[email protected]> writes:

> It was found that with a perf profile of a compute workload (at 1500
> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
> tick related functions account for quite a significant portion of
> the total kernel cpu cycles.
>
> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 0.47% reaim [kernel.kallsyms] [k] entity_tick
> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
> 0.03% reaim [kernel.kallsyms] [k] update_curr
>
> The scheduling tick functions account for about 1.22% of the total
> CPU cycles. Of the top 2 function in the above list, the reading
> and writing of the tg->load_avg variable account for over 90% of the
> CPU cycles:
>
> atomic_long_add(tg_contrib, &tg->load_avg);
> atomic_long_read(&tg->load_avg) + 1);
>
> This patch reduces the contention on the load_avg variable (and
> secondarily on the runnable_avg variable) by the following 2 measures:
>
> 1. Make the load_avg and runnable_avg fields of the task_group
> structure sit in their own cacheline without sharing it with others.
> This only applies if the kernel is built for NUMA systems with
> multiple sockets.

How much of the benefit comes from this (and how much for load_avg vs
runnable_avg vs just one separate cache_line for the pair)?
>
> 2. Use atomic_long_add_return() to update the fields and save the
> returned value in a temporary location in the cfs structure to
> be used later instead of reading the fields directly.
>

This is safe for tg->runnable_avg, as it only lasts for one line of
__update_entity_load_avg_contrib, and is never used for rq->cfs. That
said, given that it is such a short and contained duration it seems
simpler to just pass it around in __update_entity_load_avg_contrib
rather than make a new field on cfs_rq.

> The second change does require some changes in the ordering of how
> some of the average counts are being computed and hence may have a
> slight effect on their behavior.
>
> With these 2 changes, the perf profile becomes:
>
> 0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 0.05% reaim [kernel.kallsyms] [k] update_cfs_shares
> 0.04% reaim [kernel.kallsyms] [k] update_curr
> 0.04% reaim [kernel.kallsyms] [k] entity_tick
>
> The %CPU cycle is reduced to about 0.55%. It is not a big change,
> but it did improve the compute benchmark slightly from 398509 JPM
> (Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced
> the reported systime from 50.03s to 48.37s.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/sched/fair.c | 29 ++++++++++++++++++++++-------
> kernel/sched/sched.h | 14 ++++++++++++--
> 2 files changed, 34 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c7395d9..c4aa86d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
> * to gain a more accurate current total weight. See
> * update_cfs_rq_load_contribution().
> */
> - tg_weight = atomic_long_read(&tg->load_avg);
> + /* Use the saved version of tg's load_avg, if available */
> + tg_weight = cfs_rq->tg_load_save;
> + if (!tg_weight)
> + tg_weight = atomic_long_read(&tg->load_avg);
> tg_weight -= cfs_rq->tg_load_contrib;
> tg_weight += cfs_rq->load.weight;
>
> @@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> tg_contrib -= cfs_rq->tg_load_contrib;
>
> if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> - atomic_long_add(tg_contrib, &tg->load_avg);
> + cfs_rq->tg_load_save =
> + atomic_long_add_return(tg_contrib, &tg->load_avg);
> cfs_rq->tg_load_contrib += tg_contrib;
> }
> }
> @@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> contrib -= cfs_rq->tg_runnable_contrib;
>
> if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> - atomic_add(contrib, &tg->runnable_avg);
> + cfs_rq->tg_runnable_save =
> + atomic_add_return(contrib, &tg->runnable_avg);
> cfs_rq->tg_runnable_contrib += contrib;
> }
> }
> @@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
> struct cfs_rq *cfs_rq = group_cfs_rq(se);
> struct task_group *tg = cfs_rq->tg;
> int runnable_avg;
> + long load_avg;
>
> u64 contrib;
>
> contrib = cfs_rq->tg_load_contrib * tg->shares;
> - se->avg.load_avg_contrib = div_u64(contrib,
> - atomic_long_read(&tg->load_avg) + 1);
> + /*
> + * Retrieve & clear the saved tg's load_avg and use it if not 0
> + */
> + load_avg = cfs_rq->tg_load_save;
> + cfs_rq->tg_load_save = 0;
> + if (unlikely(!load_avg))
> + load_avg = atomic_long_read(&tg->load_avg);
> + se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1);
>
> /*
> * For group entities we need to compute a correction term in the case
> @@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
> * of consequential size guaranteed to see n_i*w_i quickly converge to
> * our upper bound of 1-cpu.
> */
> - runnable_avg = atomic_read(&tg->runnable_avg);
> + runnable_avg = cfs_rq->tg_runnable_save;
> + cfs_rq->tg_runnable_save = 0;
> + if (unlikely(!runnable_avg))
> + runnable_avg = atomic_read(&tg->runnable_avg);
> if (runnable_avg < NICE_0_LOAD) {
> se->avg.load_avg_contrib *= runnable_avg;
> se->avg.load_avg_contrib >>= NICE_0_SHIFT;
> @@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> /*
> * Ensure that runnable average is periodically updated.
> */
> - update_entity_load_avg(curr, 1);
> update_cfs_rq_blocked_load(cfs_rq, 1);
> update_cfs_shares(cfs_rq);
> + update_entity_load_avg(curr, 1);
You've confused group_cfs_rq(curr) and cfs_rq=cfs_rq_of(curr) here -
there is no need to do this accuracy-reducing reordering.
update_cfs_rq_blocked_load would set cfs_rq->tg_load_save, and then
entity_tick(curr->parent) called this same tick would read this value,
the same way enqueue/dequeue will do what you wanted.

That said, there is still a problem that tg_load_save could escape in
cases where __update_entity_load_avg_contrib gets skipped, either via
__update_entity_load_avg_contrib not crossing a boundary or
enqueue/dequeue aborting early due to cfs_rq_throttled. Worst case
should be accessing a value ~1ms old though, which might be acceptable.

2014-01-22 16:25:28

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

On 01/16/2014 07:44 AM, Peter Zijlstra wrote:
> First of.. WTF is v1?
>
> Secondly, please always CC the authors of the code you're changing.

The v1 patch was sent quite a while ago on 9/21/2013. See
https://lkml.org/lkml/2013/9/23/551

There was no feedback at that time. As this was not a high priority
patch for me, I didn't follow up at that time. I will include the change
log in the next version.

> On Wed, Jan 15, 2014 at 09:22:36PM -0500, Waiman Long wrote:
>> It was found that with a perf profile of a compute workload (at 1500
>> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
>> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
>> tick related functions account for quite a significant portion of
>> the total kernel cpu cycles.
>>
>> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
>> 0.47% reaim [kernel.kallsyms] [k] entity_tick
>> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
>> 0.03% reaim [kernel.kallsyms] [k] update_curr
>>
>> The scheduling tick functions account for about 1.22% of the total
>> CPU cycles. Of the top 2 function in the above list, the reading
>> and writing of the tg->load_avg variable account for over 90% of the
>> CPU cycles:
>>
>> atomic_long_add(tg_contrib,&tg->load_avg);
>> atomic_long_read(&tg->load_avg) + 1);
>>
>> This patch reduces the contention on the load_avg variable (and
>> secondarily on the runnable_avg variable) by the following 2 measures:
>>
>> 1. Make the load_avg and runnable_avg fields of the task_group
>> structure sit in their own cacheline without sharing it with others.
>> This only applies if the kernel is built for NUMA systems with
>> multiple sockets.
> So why not for SMP?

The cache coherency traffic is generally not a problem for single-socket
multi-core system, that is why I currently increase the data structure
size only for kernel built for multi-socket systems. Of course, I can
also enable it for SMP system in general.

> Also, what's the difference between having both of them in the same
> cacheline as opposed to a cacheline each?
> They're both touched from the same tick, so it makes sense to have them
> in one cacheline. Now you get to move two lines into exclusive state,
> instead of just the one.

Below is the performance data for different cacheline placements:

Cacheline Placement | %CPU | JPM |
---------------------+-------+--------+
2 separate cachelines| 0.55% | 405803 |
1 common cacheline | 1.01% | 403462 |
2nd change only | 1.06% | 403820 |
Original code | 1.22% | 398509 |

It seems like forcing the 2 fields to be in the same cacheline actually
make it perform a little bit worse. It is likely that the 2 fields just
happen to be in 2 different cachelines in x86.

>> 2. Use atomic_long_add_return() to update the fields and save the
>> returned value in a temporary location in the cfs structure to
>> be used later instead of reading the fields directly.
> Then why aren't this two patches?

I will break it into 2 patches.

> Furthermore, I completely hate the way you implemented this; the stuff
> like in the first hunk below makes the entire code flow horrid. Its
> already difficult code, using conditional variables makes it even worse.

I can try to encapsulate the change in macros to not disrupt the current
flow.

> Who's to say your 'cached' value is recent? You didn't put in a call
> chain analysis to show you always first pass through the add_return()
> before using the cached value.

Will provide a more detailed call chain analysis to show when and how
the cache value is used.

>> The second change does require some changes in the ordering of how
>> some of the average counts are being computed and hence may have a
>> slight effect on their behavior.
> Might have is no good, either you work through it and make damn sure its
> solid or you walk.

I will do a more detailed analysis and provide that in the change log.

> Preserved the rest for the added Cc's.
>

Will do.

-Longman

2014-01-22 17:14:07

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

On 01/16/2014 01:21 PM, [email protected] wrote:
> Waiman Long<[email protected]> writes:
>
>> It was found that with a perf profile of a compute workload (at 1500
>> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
>> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
>> tick related functions account for quite a significant portion of
>> the total kernel cpu cycles.
>>
>> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
>> 0.47% reaim [kernel.kallsyms] [k] entity_tick
>> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
>> 0.03% reaim [kernel.kallsyms] [k] update_curr
>>
>> The scheduling tick functions account for about 1.22% of the total
>> CPU cycles. Of the top 2 function in the above list, the reading
>> and writing of the tg->load_avg variable account for over 90% of the
>> CPU cycles:
>>
>> atomic_long_add(tg_contrib,&tg->load_avg);
>> atomic_long_read(&tg->load_avg) + 1);
>>
>> This patch reduces the contention on the load_avg variable (and
>> secondarily on the runnable_avg variable) by the following 2 measures:
>>
>> 1. Make the load_avg and runnable_avg fields of the task_group
>> structure sit in their own cacheline without sharing it with others.
>> This only applies if the kernel is built for NUMA systems with
>> multiple sockets.
> How much of the benefit comes from this (and how much for load_avg vs
> runnable_avg vs just one separate cache_line for the pair)?

Below are the performance data for different cacheline placement:

Cacheline Placement | %CPU | JPM |
---------------------+-------+--------+
2 separate cachelines| 0.55% | 405803 |
1 common cacheline | 1.01% | 403462 |
2nd change only | 1.06% | 403820 |
Original code | 1.22% | 398509 |

It seems like forcing the 2 fields to be in the same cacheline actually
make it perform a little bit worse. It is likely that the 2 fields were
actually in 2 different cacheline in x86.

>> 2. Use atomic_long_add_return() to update the fields and save the
>> returned value in a temporary location in the cfs structure to
>> be used later instead of reading the fields directly.
>>
> This is safe for tg->runnable_avg, as it only lasts for one line of
> __update_entity_load_avg_contrib, and is never used for rq->cfs. That
> said, given that it is such a short and contained duration it seems
> simpler to just pass it around in __update_entity_load_avg_contrib
> rather than make a new field on cfs_rq.

Thank for the suggestion, I will look into that.

>> The second change does require some changes in the ordering of how
>> some of the average counts are being computed and hence may have a
>> slight effect on their behavior.
>>
>> With these 2 changes, the perf profile becomes:
>>
>> 0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
>> 0.05% reaim [kernel.kallsyms] [k] update_cfs_shares
>> 0.04% reaim [kernel.kallsyms] [k] update_curr
>> 0.04% reaim [kernel.kallsyms] [k] entity_tick
>>
>> The %CPU cycle is reduced to about 0.55%. It is not a big change,
>> but it did improve the compute benchmark slightly from 398509 JPM
>> (Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced
>> the reported systime from 50.03s to 48.37s.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/sched/fair.c | 29 ++++++++++++++++++++++-------
>> kernel/sched/sched.h | 14 ++++++++++++--
>> 2 files changed, 34 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c7395d9..c4aa86d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>> * to gain a more accurate current total weight. See
>> * update_cfs_rq_load_contribution().
>> */
>> - tg_weight = atomic_long_read(&tg->load_avg);
>> + /* Use the saved version of tg's load_avg, if available */
>> + tg_weight = cfs_rq->tg_load_save;
>> + if (!tg_weight)
>> + tg_weight = atomic_long_read(&tg->load_avg);
>> tg_weight -= cfs_rq->tg_load_contrib;
>> tg_weight += cfs_rq->load.weight;
>>
>> @@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>> tg_contrib -= cfs_rq->tg_load_contrib;
>>
>> if (force_update || abs(tg_contrib)> cfs_rq->tg_load_contrib / 8) {
>> - atomic_long_add(tg_contrib,&tg->load_avg);
>> + cfs_rq->tg_load_save =
>> + atomic_long_add_return(tg_contrib,&tg->load_avg);
>> cfs_rq->tg_load_contrib += tg_contrib;
>> }
>> }
>> @@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>> contrib -= cfs_rq->tg_runnable_contrib;
>>
>> if (abs(contrib)> cfs_rq->tg_runnable_contrib / 64) {
>> - atomic_add(contrib,&tg->runnable_avg);
>> + cfs_rq->tg_runnable_save =
>> + atomic_add_return(contrib,&tg->runnable_avg);
>> cfs_rq->tg_runnable_contrib += contrib;
>> }
>> }
>> @@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> struct task_group *tg = cfs_rq->tg;
>> int runnable_avg;
>> + long load_avg;
>>
>> u64 contrib;
>>
>> contrib = cfs_rq->tg_load_contrib * tg->shares;
>> - se->avg.load_avg_contrib = div_u64(contrib,
>> - atomic_long_read(&tg->load_avg) + 1);
>> + /*
>> + * Retrieve& clear the saved tg's load_avg and use it if not 0
>> + */
>> + load_avg = cfs_rq->tg_load_save;
>> + cfs_rq->tg_load_save = 0;
>> + if (unlikely(!load_avg))
>> + load_avg = atomic_long_read(&tg->load_avg);
>> + se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1);
>>
>> /*
>> * For group entities we need to compute a correction term in the case
>> @@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> * of consequential size guaranteed to see n_i*w_i quickly converge to
>> * our upper bound of 1-cpu.
>> */
>> - runnable_avg = atomic_read(&tg->runnable_avg);
>> + runnable_avg = cfs_rq->tg_runnable_save;
>> + cfs_rq->tg_runnable_save = 0;
>> + if (unlikely(!runnable_avg))
>> + runnable_avg = atomic_read(&tg->runnable_avg);
>> if (runnable_avg< NICE_0_LOAD) {
>> se->avg.load_avg_contrib *= runnable_avg;
>> se->avg.load_avg_contrib>>= NICE_0_SHIFT;
>> @@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> /*
>> * Ensure that runnable average is periodically updated.
>> */
>> - update_entity_load_avg(curr, 1);
>> update_cfs_rq_blocked_load(cfs_rq, 1);
>> update_cfs_shares(cfs_rq);
>> + update_entity_load_avg(curr, 1);
> You've confused group_cfs_rq(curr) and cfs_rq=cfs_rq_of(curr) here -
> there is no need to do this accuracy-reducing reordering.
> update_cfs_rq_blocked_load would set cfs_rq->tg_load_save, and then
> entity_tick(curr->parent) called this same tick would read this value,
> the same way enqueue/dequeue will do what you wanted.

I will try to do it without reordering calls here.

> That said, there is still a problem that tg_load_save could escape in
> cases where __update_entity_load_avg_contrib gets skipped, either via
> __update_entity_load_avg_contrib not crossing a boundary or
> enqueue/dequeue aborting early due to cfs_rq_throttled. Worst case
> should be accessing a value ~1ms old though, which might be acceptable.

Will provide a more detailed analysis of all possible cases in the next
version of the patch.

-Longman