Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933962AbdC3N77 (ORCPT ); Thu, 30 Mar 2017 09:59:59 -0400 Received: from mail-wr0-f195.google.com ([209.85.128.195]:36368 "EHLO mail-wr0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933474AbdC3N75 (ORCPT ); Thu, 30 Mar 2017 09:59:57 -0400 MIME-Version: 1.0 In-Reply-To: <20170330133802.GC3626@lerouge> References: <20170323165512.60945ac6@redhat.com> <1490636129.8850.76.camel@redhat.com> <20170328132406.7d23579c@redhat.com> <20170329131656.1d6cb743@redhat.com> <1490818125.28917.11.camel@redhat.com> <1490848051.4167.57.camel@gmx.de> <20170330133802.GC3626@lerouge> From: Wanpeng Li Date: Thu, 30 Mar 2017 21:59:54 +0800 Message-ID: Subject: Re: [BUG nohz]: wrong user and system time accounting To: Frederic Weisbecker Cc: Mike Galbraith , Rik van Riel , Luiz Capitulino , "linux-kernel@vger.kernel.org" , Peter Zijlstra , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4472 Lines: 108 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: >> Cc Peterz, Thomas, >> 2017-03-30 12:27 GMT+08:00 Mike Galbraith : >> > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: >> > >> >> In other words, the tick on cpu0 is aligned >> >> with the tick on the nohz_full cpus, and >> >> jiffies is advanced while the nohz_full cpus >> >> with an active tick happen to be in kernel >> >> mode? >> > >> > You really want skew_tick=1, especially on big boxen. >> > >> >> Frederic, can you think of any reason why >> >> the tick on nohz_full CPUs would end up aligned >> >> with the tick on cpu0, instead of running at some >> >> random offset? >> > >> > (I or low rq->clock bits as crude NOHZ collision avoidance) >> > >> >> A random offset, or better yet a somewhat randomized >> >> tick length to make sure that simultaneous ticks are >> >> fairly rare and the vtime sampling does not end up >> >> "in phase" with the jiffies incrementing, could make >> >> the accounting work right again. >> > >> > That improves jitter, especially on big boxen. I have an 8 socket box >> > that thinks it's an extra large PC, there, collision avoidance matters >> > hugely. I couldn't reproduce bean counting woes, no idea if collision >> > avoidance will help that. >> >> So I implement two methods, one is from Rik's random offset proposal >> through skew tick, the other one is from Frederic's proposal and it is >> the same as my original idea through use nanosecond granularity to >> check deltas but only perform an actual cputime update when that delta >> >= TICK_NSEC. Both methods can solve the bug which Luiz reported. >> Peterz, Thomas, any ideas? >> >> --------------------------->8------------------------------------------------------------- >> >> skew tick: >> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c >> index 7fe53be..9981437 100644 >> --- a/kernel/time/tick-sched.c >> +++ b/kernel/time/tick-sched.c >> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void) >> hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); >> >> /* Offset the tick to avert jiffies_lock contention. */ >> +#ifdef CONFIG_NO_HZ_FULL >> + if (sched_skew_tick || tick_nohz_full_running) { >> +#else >> if (sched_skew_tick) { >> +#endif > > Please rather use tick_nohz_full_enabled() to avoid ifdeffery. > >> u64 offset = ktime_to_ns(tick_period) >> 1; >> do_div(offset, num_possible_cpus()); >> offset *= smp_processor_id(); > > If it works, we may want to take that solution, likely less performance sensitive > than using sched_clock(). In fact sched_clock() is fast, especially as we require it to > be stable for nohz_full, but using it involves costly conversion back and forth to jiffies. So both Rik and you agree with the skew tick solution, I will try it tomorrow. Btw, if we should just add random offset to the cpu in the nohz_full mode or add random offset to all cpus like the codes above? Regards, Wanpeng Li > >> >> -------------------------------------->8----------------------------------------------------- >> >> use nanosecond granularity to check deltas but only perform an actual >> cputime update when that delta >= TICK_NSEC. >> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> index f3778e2b..f1ee393 100644 >> --- a/kernel/sched/cputime.c >> +++ b/kernel/sched/cputime.c >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct >> task_struct *p, u64 *ut, u64 *st) >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >> static u64 vtime_delta(struct task_struct *tsk) >> { >> - unsigned long now = READ_ONCE(jiffies); >> + u64 now = local_clock(); > > I fear we need a global clock, because the reader (task_cputime()) needs > to compute the delta and therefore use the same clock from any CPU. > > Or we can use the local_clock() but the reader must access the same. > > So there would be vtime_delta_writer() which uses local_clock and stores > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which > is protected by vtime_seqcount as well. > > Although those sched_clock_cpu() things seem to only matter when the > sched_clock() is unstable. And that stability is a condition for nohz_full > to work anyway. So probably sched_clock() alone would be enough. > > Thanks.