Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752618AbaAVROH (ORCPT ); Wed, 22 Jan 2014 12:14:07 -0500 Received: from g4t0017.houston.hp.com ([15.201.24.20]:18069 "EHLO g4t0017.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751612AbaAVROD (ORCPT ); Wed, 22 Jan 2014 12:14:03 -0500 Message-ID: <52DFFC44.4030104@hp.com> Date: Wed, 22 Jan 2014 12:13:40 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: bsegall@google.com CC: Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, "Paul E. McKenney" , Frederic Weisbecker , "Eric W. Biederman" , Andrew Morton , Serge Hallyn , Aswin Chandramouleeswaran , Scott J Norton Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg References: <1389838956-56574-1-git-send-email-Waiman.Long@hp.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/16/2014 01:21 PM, bsegall@google.com wrote: > Waiman Long writes: > >> It was found that with a perf profile of a compute workload (at 1500 >> users) of the AIM7 benchmark running on a glueless 4-socket 40-core >> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling >> tick related functions account for quite a significant portion of >> the total kernel cpu cycles. >> >> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load >> 0.47% reaim [kernel.kallsyms] [k] entity_tick >> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares >> 0.03% reaim [kernel.kallsyms] [k] update_curr >> >> The scheduling tick functions account for about 1.22% of the total >> CPU cycles. Of the top 2 function in the above list, the reading >> and writing of the tg->load_avg variable account for over 90% of the >> CPU cycles: >> >> atomic_long_add(tg_contrib,&tg->load_avg); >> atomic_long_read(&tg->load_avg) + 1); >> >> This patch reduces the contention on the load_avg variable (and >> secondarily on the runnable_avg variable) by the following 2 measures: >> >> 1. Make the load_avg and runnable_avg fields of the task_group >> structure sit in their own cacheline without sharing it with others. >> This only applies if the kernel is built for NUMA systems with >> multiple sockets. > How much of the benefit comes from this (and how much for load_avg vs > runnable_avg vs just one separate cache_line for the pair)? Below are the performance data for different cacheline placement: Cacheline Placement | %CPU | JPM | ---------------------+-------+--------+ 2 separate cachelines| 0.55% | 405803 | 1 common cacheline | 1.01% | 403462 | 2nd change only | 1.06% | 403820 | Original code | 1.22% | 398509 | It seems like forcing the 2 fields to be in the same cacheline actually make it perform a little bit worse. It is likely that the 2 fields were actually in 2 different cacheline in x86. >> 2. Use atomic_long_add_return() to update the fields and save the >> returned value in a temporary location in the cfs structure to >> be used later instead of reading the fields directly. >> > This is safe for tg->runnable_avg, as it only lasts for one line of > __update_entity_load_avg_contrib, and is never used for rq->cfs. That > said, given that it is such a short and contained duration it seems > simpler to just pass it around in __update_entity_load_avg_contrib > rather than make a new field on cfs_rq. Thank for the suggestion, I will look into that. >> The second change does require some changes in the ordering of how >> some of the average counts are being computed and hence may have a >> slight effect on their behavior. >> >> With these 2 changes, the perf profile becomes: >> >> 0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load >> 0.05% reaim [kernel.kallsyms] [k] update_cfs_shares >> 0.04% reaim [kernel.kallsyms] [k] update_curr >> 0.04% reaim [kernel.kallsyms] [k] entity_tick >> >> The %CPU cycle is reduced to about 0.55%. It is not a big change, >> but it did improve the compute benchmark slightly from 398509 JPM >> (Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced >> the reported systime from 50.03s to 48.37s. >> >> Signed-off-by: Waiman Long >> --- >> kernel/sched/fair.c | 29 ++++++++++++++++++++++------- >> kernel/sched/sched.h | 14 ++++++++++++-- >> 2 files changed, 34 insertions(+), 9 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index c7395d9..c4aa86d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) >> * to gain a more accurate current total weight. See >> * update_cfs_rq_load_contribution(). >> */ >> - tg_weight = atomic_long_read(&tg->load_avg); >> + /* Use the saved version of tg's load_avg, if available */ >> + tg_weight = cfs_rq->tg_load_save; >> + if (!tg_weight) >> + tg_weight = atomic_long_read(&tg->load_avg); >> tg_weight -= cfs_rq->tg_load_contrib; >> tg_weight += cfs_rq->load.weight; >> >> @@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, >> tg_contrib -= cfs_rq->tg_load_contrib; >> >> if (force_update || abs(tg_contrib)> cfs_rq->tg_load_contrib / 8) { >> - atomic_long_add(tg_contrib,&tg->load_avg); >> + cfs_rq->tg_load_save = >> + atomic_long_add_return(tg_contrib,&tg->load_avg); >> cfs_rq->tg_load_contrib += tg_contrib; >> } >> } >> @@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa, >> contrib -= cfs_rq->tg_runnable_contrib; >> >> if (abs(contrib)> cfs_rq->tg_runnable_contrib / 64) { >> - atomic_add(contrib,&tg->runnable_avg); >> + cfs_rq->tg_runnable_save = >> + atomic_add_return(contrib,&tg->runnable_avg); >> cfs_rq->tg_runnable_contrib += contrib; >> } >> } >> @@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) >> struct cfs_rq *cfs_rq = group_cfs_rq(se); >> struct task_group *tg = cfs_rq->tg; >> int runnable_avg; >> + long load_avg; >> >> u64 contrib; >> >> contrib = cfs_rq->tg_load_contrib * tg->shares; >> - se->avg.load_avg_contrib = div_u64(contrib, >> - atomic_long_read(&tg->load_avg) + 1); >> + /* >> + * Retrieve& clear the saved tg's load_avg and use it if not 0 >> + */ >> + load_avg = cfs_rq->tg_load_save; >> + cfs_rq->tg_load_save = 0; >> + if (unlikely(!load_avg)) >> + load_avg = atomic_long_read(&tg->load_avg); >> + se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1); >> >> /* >> * For group entities we need to compute a correction term in the case >> @@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) >> * of consequential size guaranteed to see n_i*w_i quickly converge to >> * our upper bound of 1-cpu. >> */ >> - runnable_avg = atomic_read(&tg->runnable_avg); >> + runnable_avg = cfs_rq->tg_runnable_save; >> + cfs_rq->tg_runnable_save = 0; >> + if (unlikely(!runnable_avg)) >> + runnable_avg = atomic_read(&tg->runnable_avg); >> if (runnable_avg< NICE_0_LOAD) { >> se->avg.load_avg_contrib *= runnable_avg; >> se->avg.load_avg_contrib>>= NICE_0_SHIFT; >> @@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) >> /* >> * Ensure that runnable average is periodically updated. >> */ >> - update_entity_load_avg(curr, 1); >> update_cfs_rq_blocked_load(cfs_rq, 1); >> update_cfs_shares(cfs_rq); >> + update_entity_load_avg(curr, 1); > You've confused group_cfs_rq(curr) and cfs_rq=cfs_rq_of(curr) here - > there is no need to do this accuracy-reducing reordering. > update_cfs_rq_blocked_load would set cfs_rq->tg_load_save, and then > entity_tick(curr->parent) called this same tick would read this value, > the same way enqueue/dequeue will do what you wanted. I will try to do it without reordering calls here. > That said, there is still a problem that tg_load_save could escape in > cases where __update_entity_load_avg_contrib gets skipped, either via > __update_entity_load_avg_contrib not crossing a boundary or > enqueue/dequeue aborting early due to cfs_rq_throttled. Worst case > should be accessing a value ~1ms old though, which might be acceptable. Will provide a more detailed analysis of all possible cases in the next version of the patch. -Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/