Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752709AbbKYTLL (ORCPT ); Wed, 25 Nov 2015 14:11:11 -0500 Received: from g1t6213.austin.hp.com ([15.73.96.121]:54917 "EHLO g1t6213.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752439AbbKYTKL (ORCPT ); Wed, 25 Nov 2015 14:10:11 -0500 From: Waiman Long To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Waiman Long Subject: [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Date: Wed, 25 Nov 2015 14:09:39 -0500 Message-Id: <1448478580-26467-3-git-send-email-Waiman.Long@hpe.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com> References: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2875 Lines: 71 If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling was enabled which was true for most distributions. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533 Critical-jOps: 134877 After patch - Max-jOPs: 916011 Critical-jOps: 142366 Signed-off-by: Waiman Long --- kernel/sched/sched.h | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index efd3bfc..e679895 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -248,7 +248,12 @@ struct task_group { unsigned long shares; #ifdef CONFIG_SMP - atomic_long_t load_avg; + /* + * load_avg can be heavily contended at clock tick time, so put + * it in its own cacheline separated from the fields above which + * will also be accessed at each tick. + */ + atomic_long_t load_avg ____cacheline_aligned; #endif #endif -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/