Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932662AbdDFGgo (ORCPT ); Thu, 6 Apr 2017 02:36:44 -0400 Received: from mail-pg0-f66.google.com ([74.125.83.66]:33886 "EHLO mail-pg0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752913AbdDFGgf (ORCPT ); Thu, 6 Apr 2017 02:36:35 -0400 From: Wanpeng Li X-Google-Original-From: Wanpeng Li To: linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Wanpeng Li , Frederic Weisbecker , Rik van Riel , Mike Galbraith , Luiz Capitulino , Thomas Gleixner Subject: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Date: Wed, 5 Apr 2017 23:36:24 -0700 Message-Id: <1491460584-4908-1-git-send-email-wanpeng.li@hotmail.com> X-Mailer: git-send-email 2.7.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4931 Lines: 102 From: Wanpeng Li This can be reproduced by setting up a CPU for nohz_full and pin two tasks that hog the CPU 100% of the time to that CPU, top reports 70% system time and 30% user time utilization sometimes. The trace like this: hog-11980 [015] 341.494491: function: enter_from_user_mode <-- apic_timer_interrupt -0 [000] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt hog-11980 [015] 341.494492: function: __context_tracking_exit <-- enter_from_user_mode -0 [000] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339) hog-11980 [015] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt hog-11980 [015] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt hog-11980 [015] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues -0 [000] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues -0 [000] 341.494493: function: tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer -0 [000] 341.494494: function: do_timer <-- tick_do_update_jiffies64.part.14 hog-11980 [015] 341.494494: function: irq_exit <-- smp_apic_timer_interrupt -0 [000] 341.494494: bprint: do_timer: updated jiffies_64=4295008340 ticks=1 hog-11980 [015] 341.494494: function: __context_tracking_enter <-- prepare_exit_to_usermode hog-11980 [015] 341.494494: function: vtime_user_enter <-- __context_tracking_enter hog-11980 [015] 341.494495: bprint: vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339) hog-11980 [015] 341.494495: function: __vtime_account_system <-- vtime_user_enter hog-11980 [015] 341.494495: bprint: get_vtime_delta: vtime_snap=4295008339 now=4295008340 hog-11980 [015] 341.494495: function: account_system_time <-- __vtime_account_system hog-11980 [015] 341.494495: bprint: account_system_time: cputime=995488 -0 [000] 341.494497: function: irq_exit <-- smp_apic_timer_interrupt In this trace, we see the following: 1. On CPU15, we transition from user-space to kernel-space because of a timer interrupt (it's the tick) 2. vtimer_delta() returns 0, because jiffies didn't change since the last accounting 3. While CPU15 is executing in kernel-space, jiffies is updated by CPU0 4. When going back to user-space, vtime_delta() returns non-zero and the whole time is accounted for system time (observe how the cputime parameter in account_system_time() is less than 1ms) After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to jiffy based sampling. So the issue can happen if all CPUs fire their ticks at the same time: CPU 0 CPU 1 ----- ----- exit_user() // no cputime update tick X update_jiffies enter_user() // cputime update exit_user() // no cputime update tick X+1 update_jiffies enter_user() // cputime update The nohz_full CPU receives an interrupt at the same time the timer interrupt fires on the housekeeping CPU. This patch offsets the tick to avert all ticks alignment in order that the vtime sampling does not end up "in phase" with the jiffies incrementing. Reported-by: Luiz Capitulino Suggested-by: Rik van Riel Cc: Frederic Weisbecker Cc: Rik van Riel Cc: Mike Galbraith Cc: Luiz Capitulino Cc: Thomas Gleixner Cc: Peter Zijlstra Signed-off-by: Wanpeng Li --- kernel/time/tick-sched.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 7fe53be..35d79f7 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void) /* Get the next period (per-CPU) */ hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); - /* Offset the tick to avert jiffies_lock contention. */ - if (sched_skew_tick) { + /* + * Offset the tick to avert jiffies_lock contention, and all ticks + * alignment in order that the vtime sampling does not end up "in + * phase" with the jiffies incrementing. + */ + if (sched_skew_tick || tick_nohz_full_enabled()) { u64 offset = ktime_to_ns(tick_period) >> 1; do_div(offset, num_possible_cpus()); offset *= smp_processor_id(); -- 2.7.4