Message-ID: <565C9FDC.9020603@hpe.com>
Date: Mon, 30 Nov 2015 14:13:32 -0500
From: Waiman Long <waiman.long@hpe.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        Scott J Norton <scott.norton@hpe.com>,
        Douglas Hatch <doug.hatch@hpe.com>, Paul Turner <pjt@google.com>,
        Ben Segall <bsegall@google.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Yuyang Du <yuyang.du@intel.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers
 and writers of load_avg
References: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com> <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com> <20151130102240.GH17308@twins.programming.kicks-ass.net>
In-Reply-To: <20151130102240.GH17308@twins.programming.kicks-ass.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8776
Lines: 203

On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> Please always Cc the people who wrote the code.
>
> +CC pjt, ben, morten, yuyang

Sorry for that. Their names didn't show up when I did get_maintainer.pl.

> On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
>> The load_avg statistical counter is only changed if the load on a CPU
>> deviates significantly from the previous tick. So there are usually
>> more readers than writers of load_avg. Still, on a large system,
>> the cacheline contention can cause significant slowdown and impact
>> performance.
>>
>> This patch attempts to separate those load_avg readers
>> (update_cfs_shares) and writers (task_tick_fair) to use different
>> cachelines instead. Writers of load_avg will now accumulates the
>> load delta into load_avg_delta which sits in a different cacheline.
>> If load_avg_delta is sufficiently large (>  load_avg/64), it will then
>> be added back to load_avg.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
>> 480 threads), the perf profile before the patch was:
>>
>>     9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares
>>
>> After the patch, it became:
>>
>>     2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares
> This begs the question tough; why are you running a global load in a
> cgroup; and do we really need to update this for the root cgroup? It
> seems to me we don't need calc_tg_weight() for the root cgroup, it
> doesn't need to normalize its weight numbers.
>
> That is; isn't this simply a problem we should avoid?

I didn't use any cgroup in my test setup. Autogroup was enabled, though. 
Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 
times.

>> The benchmark results before and after the patch were as follows:
>>
>>    Before patch - Max-jOPs: 916011    Critical-jOps: 142366
>>    AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937
>>
>> There was significant improvement in Critical-jOps which was latency
>> sensitive.
>>
>> This patch does introduce additional delay in getting the real load
>> average reflected in load_avg. It may also incur additional overhead
>> if the number of CPUs in a task group is small. As a result, this
>> change is only activated when running on a 4-socket or larger systems
>> which can get the most benefit from it.
> So I'm not particularly charmed by this; it rather makes a mess of
> things. Also this really wants a run of the cgroup fairness test thingy
> pjt/ben have somewhere.

I will be glad to run any additional tests, if necessary. I do need 
pointers to those test, though.

>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/sched/core.c  |    9 +++++++++
>>   kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h |    8 ++++++++
>>   3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..f3075da 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>>   		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>>   		ptr += nr_cpu_ids * sizeof(void **);
>>
>> +#ifdef CONFIG_SMP
>> +		/*
>> +		 * Use load_avg_delta if not 2P or less
>> +		 */
>> +		root_task_group.use_la_delta = (num_possible_nodes()>  2);
>> +#endif /* CONFIG_SMP */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>   #ifdef CONFIG_RT_GROUP_SCHED
>>   		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   	if (!alloc_rt_sched_group(tg, parent))
>>   		goto err;
>>
>> +#if defined(CONFIG_FAIR_GROUP_SCHED)&&  defined(CONFIG_SMP)
>> +	tg->use_la_delta = root_task_group.use_la_delta;
>> +#endif
>>   	return tg;
>>
>>   err:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8f1eccc..44732cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   /*
>> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
>> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>>    * and effective_load (which is not done because it is too costly).
>> + *
>> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
>> + * accumulated into the load_avg_delta variable instead to reduce cacheline
>> + * contention on load_avg at the expense of more delay in reflecting the real
>> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
>> + * cachelines. With that flag set, load_avg will be read mostly whereas
>> + * load_avg_delta will be write mostly.
>>    */
>>   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>>   {
>>   	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>
>>   	if (force || abs(delta)>  cfs_rq->tg_load_avg_contrib / 64) {
>> -		atomic_long_add(delta,&cfs_rq->tg->load_avg);
>> +		struct task_group *tg = cfs_rq->tg;
>> +		long load_avg, tot_delta;
>> +
>> +		if (!tg->use_la_delta) {
>> +			/*
>> +			 * If the use_la_delta isn't set, just add the
>> +			 * delta directly into load_avg.
>> +			 */
>> +			atomic_long_add(delta,&tg->load_avg);
>> +			goto set_contrib;
>> +		}
>> +
>> +		tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
>> +		load_avg = atomic_long_read(&tg->load_avg);
>> +		if (abs(tot_delta)>  load_avg / 64) {
>> +			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
>> +			if (tot_delta)
>> +				atomic_long_add(tot_delta,&tg->load_avg);
>> +		}
>> +set_contrib:
>>   		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>>   	}
>>   }
> I'm thinking that its now far too big to retain the inline qualifier.

I can take the inline keyword out.

>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e679895..aef4e4e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -252,8 +252,16 @@ struct task_group {
>>   	 * load_avg can be heavily contended at clock tick time, so put
>>   	 * it in its own cacheline separated from the fields above which
>>   	 * will also be accessed at each tick.
>> +	 *
>> +	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
>> +	 * to accumulate the delta and only change load_avg when the delta
>> +	 * is big enough. This reduces the cacheline contention on load_avg.
>> +	 * This flag will be set at allocation time depending on the system
>> +	 * configuration.
>>   	 */
>> +	int use_la_delta;
>>   	atomic_long_t load_avg ____cacheline_aligned;
>> +	atomic_long_t load_avg_delta ____cacheline_aligned;
> This would only work if the structure itself is allocated with cacheline
> alignment, and looking at sched_create_group(), we use a plain kzalloc()
> for this, which doesn't guarantee any sort of alignment beyond machine
> word size IIRC.

With a RHEL 6 derived .config file, the size of the task_group structure 
was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag 
increase the size to 512 bytes. So it did make the structure a multiple 
of the cacheline size. With both slub and slab, the allocated task group 
pointers from kzalloc() in sched_create_group() were all multiples of 
0x200. So they were properly aligned for the ____cacheline_aligned tag 
to work.

> Also, you unconditionally grow the structure by a whole cacheline.

I know it is a drawback of using ____cacheline_aligned tag. However, we 
probably won't create too many task groups in normal use. So the 
increase in memory consumption shouldn't be noticeable.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/