Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932853AbdC2Wyf (ORCPT ); Wed, 29 Mar 2017 18:54:35 -0400 Received: from mail-wr0-f193.google.com ([209.85.128.193]:35473 "EHLO mail-wr0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752656AbdC2Wyd (ORCPT ); Wed, 29 Mar 2017 18:54:33 -0400 Date: Thu, 30 Mar 2017 00:54:30 +0200 From: Frederic Weisbecker To: Rik van Riel Cc: Luiz Capitulino , Wanpeng Li , linux-kernel@vger.kernel.org, Thomas Gleixner Subject: Re: [BUG nohz]: wrong user and system time accounting Message-ID: <20170329225428.GC23895@lerouge> References: <20170323165512.60945ac6@redhat.com> <1490636129.8850.76.camel@redhat.com> <20170328132406.7d23579c@redhat.com> <20170329131656.1d6cb743@redhat.com> <1490818125.28917.11.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1490818125.28917.11.camel@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2479 Lines: 66 (Adding Thomas in Cc) On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote: > On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote: > > On Tue, 28 Mar 2017 13:24:06 -0400 > > Luiz Capitulino wrote: > > > > > ?1. In my tracing I'm seeing that sometimes (always?) the > > > ????time interval between two timer interrupts is less than 1ms > > > > I think that's the root cause. > >? > > In this trace, we see the following: > > > > ?1. On CPU15, we transition from user-space to kernel-space because > > ????of a timer interrupt (it's the tick) > > > > ?2. vtimer_delta() returns 0, because jiffies didn't change since the > > ????last accounting > > > > ?3. While CPU15 is executing in kernel-space, jiffies is updated > > ????by CPU0 > > > > ?4. When going back to user-space, vtime_delta() returns non-zero > > ????and the whole time is accounted for system time (observe how > > ????the cputime parameter in account_system_time() is less than 1ms) > > In other words, the tick on cpu0 is aligned > with the tick on the nohz_full cpus, and > jiffies is advanced while the nohz_full cpus > with an active tick happen to be in kernel > mode? Ah you found out faster than me :-) > Frederic, can you think of any reason why > the tick on nohz_full CPUs would end up aligned > with the tick on cpu0, instead of running at some > random offset? tick_init_jiffy_update() takes that decision to align all ticks. I'm not sure why. I don't see anything that could depend on that wide tick synchronization. The jiffies update itself relies on ktime to check when to update it. So even if the tick fires a bit later on CPU 1 than on CPU 0, the jiffies updates should stay coherent and should never exceed 999us delay in the worst case (for HZ=1000) Now I might overlook something. > > A random offset, or better yet a somewhat randomized > tick length to make sure that simultaneous ticks are > fairly rare and the vtime sampling does not end up > "in phase" with the jiffies incrementing, could make > the accounting work right again. > > Of course, that assumes the above hypothesis is correct :) I'm not sure that randomizing the tick start per CPU would be a right solution. Somewhere in the world you can be sure the tick randomization of some nohz_full CPU will coincide with the tick of CPU 0 :o) Or we could force that tick on nohz_full CPUs to be far from CPU 0's tick... I'm not sure such a solution would be accepted though.