From: Waiman Long <Waiman.Long@hpe.com>
To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, Scott J Norton <scott.norton@hpe.com>,
        Douglas Hatch <doug.hatch@hpe.com>, Waiman Long <Waiman.Long@hpe.com>
Subject: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
Date: Wed, 25 Nov 2015 14:09:40 -0500
Message-Id: <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com>
In-Reply-To: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com>
References: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6252
Lines: 159

The load_avg statistical counter is only changed if the load on a CPU
deviates significantly from the previous tick. So there are usually
more readers than writers of load_avg. Still, on a large system,
the cacheline contention can cause significant slowdown and impact
performance.

This patch attempts to separate those load_avg readers
(update_cfs_shares) and writers (task_tick_fair) to use different
cachelines instead. Writers of load_avg will now accumulates the
load delta into load_avg_delta which sits in a different cacheline.
If load_avg_delta is sufficiently large (> load_avg/64), it will then
be added back to load_avg.

Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
480 threads), the perf profile before the patch was:

   9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares

After the patch, it became:

   2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
   2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
   2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
   1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
   0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
   0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares

The benchmark results before and after the patch were as follows:

  Before patch - Max-jOPs: 916011    Critical-jOps: 142366
  AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937

There was significant improvement in Critical-jOps which was latency
sensitive.

This patch does introduce additional delay in getting the real load
average reflected in load_avg. It may also incur additional overhead
if the number of CPUs in a task group is small. As a result, this
change is only activated when running on a 4-socket or larger systems
which can get the most benefit from it.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/core.c  |    9 +++++++++
 kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    8 ++++++++
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..f3075da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7356,6 +7356,12 @@ void __init sched_init(void)
 		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
 		ptr += nr_cpu_ids * sizeof(void **);
 
+#ifdef CONFIG_SMP
+		/*
+		 * Use load_avg_delta if not 2P or less
+		 */
+		root_task_group.use_la_delta = (num_possible_nodes() > 2);
+#endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	tg->use_la_delta = root_task_group.use_la_delta;
+#endif
 	return tg;
 
 err:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..44732cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
- * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
  * and effective_load (which is not done because it is too costly).
+ *
+ * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
+ * accumulated into the load_avg_delta variable instead to reduce cacheline
+ * contention on load_avg at the expense of more delay in reflecting the real
+ * load_avg. The tg's load_avg and load_avg_delta variables are in separate
+ * cachelines. With that flag set, load_avg will be read mostly whereas
+ * load_avg_delta will be write mostly.
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 
 	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
-		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		struct task_group *tg = cfs_rq->tg;
+		long load_avg, tot_delta;
+
+		if (!tg->use_la_delta) {
+			/*
+			 * If the use_la_delta isn't set, just add the
+			 * delta directly into load_avg.
+			 */
+			atomic_long_add(delta, &tg->load_avg);
+			goto set_contrib;
+		}
+
+		tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
+		load_avg = atomic_long_read(&tg->load_avg);
+		if (abs(tot_delta) > load_avg / 64) {
+			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
+			if (tot_delta)
+				atomic_long_add(tot_delta, &tg->load_avg);
+		}
+set_contrib:
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e679895..aef4e4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -252,8 +252,16 @@ struct task_group {
 	 * load_avg can be heavily contended at clock tick time, so put
 	 * it in its own cacheline separated from the fields above which
 	 * will also be accessed at each tick.
+	 *
+	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
+	 * to accumulate the delta and only change load_avg when the delta
+	 * is big enough. This reduces the cacheline contention on load_avg.
+	 * This flag will be set at allocation time depending on the system
+	 * configuration.
 	 */
+	int use_la_delta;
 	atomic_long_t load_avg ____cacheline_aligned;
+	atomic_long_t load_avg_delta ____cacheline_aligned;
 #endif
 #endif
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/