Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934406AbdDGQ6M (ORCPT ); Fri, 7 Apr 2017 12:58:12 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45590 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933435AbdDGQ6F (ORCPT ); Fri, 7 Apr 2017 12:58:05 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7B8938E3EA Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=lcapitulino@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 7B8938E3EA Date: Fri, 7 Apr 2017 12:58:01 -0400 From: Luiz Capitulino To: Wanpeng Li Cc: linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Wanpeng Li , Frederic Weisbecker , Rik van Riel , Mike Galbraith , Thomas Gleixner Subject: Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Message-ID: <20170407125801.503ae098@redhat.com> In-Reply-To: <1491460584-4908-1-git-send-email-wanpeng.li@hotmail.com> References: <1491460584-4908-1-git-send-email-wanpeng.li@hotmail.com> Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Fri, 07 Apr 2017 16:58:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5339 Lines: 109 On Wed, 5 Apr 2017 23:36:24 -0700 Wanpeng Li wrote: > From: Wanpeng Li > > This can be reproduced by setting up a CPU for nohz_full and pin two tasks > that hog the CPU 100% of the time to that CPU, top reports 70% system time > and 30% user time utilization sometimes. The trace like this: > > hog-11980 [015] 341.494491: function: enter_from_user_mode <-- apic_timer_interrupt > -0 [000] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt > hog-11980 [015] 341.494492: function: __context_tracking_exit <-- enter_from_user_mode > -0 [000] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt > hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339) > hog-11980 [015] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt > hog-11980 [015] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt > hog-11980 [015] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues > -0 [000] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues > -0 [000] 341.494493: function: tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer > -0 [000] 341.494494: function: do_timer <-- tick_do_update_jiffies64.part.14 > hog-11980 [015] 341.494494: function: irq_exit <-- smp_apic_timer_interrupt > -0 [000] 341.494494: bprint: do_timer: updated jiffies_64=4295008340 ticks=1 > hog-11980 [015] 341.494494: function: __context_tracking_enter <-- prepare_exit_to_usermode > hog-11980 [015] 341.494494: function: vtime_user_enter <-- __context_tracking_enter > hog-11980 [015] 341.494495: bprint: vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339) > hog-11980 [015] 341.494495: function: __vtime_account_system <-- vtime_user_enter > hog-11980 [015] 341.494495: bprint: get_vtime_delta: vtime_snap=4295008339 now=4295008340 > hog-11980 [015] 341.494495: function: account_system_time <-- __vtime_account_system > hog-11980 [015] 341.494495: bprint: account_system_time: cputime=995488 > -0 [000] 341.494497: function: irq_exit <-- smp_apic_timer_interrupt > > In this trace, we see the following: > > 1. On CPU15, we transition from user-space to kernel-space because > of a timer interrupt (it's the tick) > > 2. vtimer_delta() returns 0, because jiffies didn't change since the > last accounting > > 3. While CPU15 is executing in kernel-space, jiffies is updated > by CPU0 > > 4. When going back to user-space, vtime_delta() returns non-zero > and the whole time is accounted for system time (observe how > the cputime parameter in account_system_time() is less than 1ms) > > > After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN > to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to > jiffy based sampling. So the issue can happen if all CPUs fire their ticks > at the same time: > > > CPU 0 CPU 1 > ----- ----- > exit_user() // no cputime update > tick X update_jiffies > enter_user() // cputime update > > > exit_user() // no cputime update > tick X+1 update_jiffies > enter_user() // cputime update > > The nohz_full CPU receives an interrupt at the same time the timer > interrupt fires on the housekeeping CPU. > > This patch offsets the tick to avert all ticks alignment in order > that the vtime sampling does not end up "in phase" with the jiffies > incrementing. > > Reported-by: Luiz Capitulino > Suggested-by: Rik van Riel > Cc: Frederic Weisbecker > Cc: Rik van Riel > Cc: Mike Galbraith > Cc: Luiz Capitulino > Cc: Thomas Gleixner > Cc: Peter Zijlstra > Signed-off-by: Wanpeng Li Reviewed-and-Tested-by: Luiz Capitulino Thanks for everyone who participated on tracking this down and fixing it! > --- > kernel/time/tick-sched.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..35d79f7 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void) > /* Get the next period (per-CPU) */ > hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); > > - /* Offset the tick to avert jiffies_lock contention. */ > - if (sched_skew_tick) { > + /* > + * Offset the tick to avert jiffies_lock contention, and all ticks > + * alignment in order that the vtime sampling does not end up "in > + * phase" with the jiffies incrementing. > + */ > + if (sched_skew_tick || tick_nohz_full_enabled()) { > u64 offset = ktime_to_ns(tick_period) >> 1; > do_div(offset, num_possible_cpus()); > offset *= smp_processor_id();