Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753267AbdDMNdG (ORCPT ); Thu, 13 Apr 2017 09:33:06 -0400 Received: from mail-wm0-f65.google.com ([74.125.82.65]:35339 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751054AbdDMNdC (ORCPT ); Thu, 13 Apr 2017 09:33:02 -0400 Date: Thu, 13 Apr 2017 15:32:57 +0200 From: Frederic Weisbecker To: Wanpeng Li Cc: Thomas Gleixner , Mike Galbraith , Rik van Riel , Luiz Capitulino , "linux-kernel@vger.kernel.org" , Peter Zijlstra Subject: Re: [BUG nohz]: wrong user and system time accounting Message-ID: <20170413133255.GA2608@lerouge> References: <1490636129.8850.76.camel@redhat.com> <20170328132406.7d23579c@redhat.com> <20170329131656.1d6cb743@redhat.com> <1490818125.28917.11.camel@redhat.com> <1490848051.4167.57.camel@gmx.de> <20170412131818.GB21309@lerouge> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3987 Lines: 112 On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote: > 2017-04-12 22:57 GMT+08:00 Thomas Gleixner : > > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: > >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: > >> > It's not different from the current jiffies based stuff at all. Same > >> > failure mode. > >> > >> Yes you're right, I got confused again. So to fix this we could do our snapshots > >> at a frequency lower than HZ but still high enough to avoid overhead. > >> > >> Something like TICK_NSEC / 2 ? > > > > If you are using TSC anyway then you can do proper accumulation for both > > system and user and only account the data when the accumulation is more > > than a jiffie. > > So I implement it as below: > > - HZ=1000. > 1) two cpu hogs on cpu in nohz_full mode, 100% user time > 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected) > - HZ=250 > 1) two cpu hogs on cpu in nohz_full mode, 100% user time > 2) Luzi's testcase, 100% idle > > So the codes below still not work correctly for HZ=250, any suggestions? Right, so first lets reorder that code a bit so we can see clear inside :-) > > -------------------------------------->8----------------------------------------------------- > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d67eee8..6a11771 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -668,6 +668,8 @@ struct task_struct { > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > seqcount_t vtime_seqcount; > unsigned long long vtime_snap; > + u64 vtime_acct_utime; > + u64 vtime_acct_stime; You need to accumulate guest and steal time as well. > enum { > /* Task is sleeping or running in a CPU with VTIME inactive: */ > VTIME_INACTIVE = 0, > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..f8e54ba 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -static u64 vtime_delta(struct task_struct *tsk) > +static u64 vtime_delta(struct task_struct *tsk, bool user) > { > - unsigned long now = READ_ONCE(jiffies); > + u64 delta, ret = 0; > > - if (time_before(now, (unsigned long)tsk->vtime_snap)) > - return 0; > + delta = sched_clock() - tsk->vtime_snap; > > - return jiffies_to_nsecs(now - tsk->vtime_snap); > + if (is_idle_task(tsk)) { > + if (delta >= TICK_NSEC) > + ret = delta; > + } else { > + if (user) { > + tsk->vtime_acct_utime += delta; > + if (tsk->vtime_acct_utime >= TICK_NSEC) > + ret = tsk->vtime_acct_utime; > + } else { > + tsk->vtime_acct_stime += delta; > + if (tsk->vtime_acct_utime >= TICK_NSEC) > + ret = tsk->vtime_acct_stime; > + } We already have vtime_account_idle, vtime_account_user, etc... The accumulation should be done by these functions that know what and where to account. vtime_delta() should really just return the difference against vtime_snap, it's too low level to care about these details. > + } > + > + return ret; > } > > -static u64 get_vtime_delta(struct task_struct *tsk) > +static u64 get_vtime_delta(struct task_struct *tsk, bool user) > { > - unsigned long now = READ_ONCE(jiffies); > - u64 delta, other; > + u64 delta = vtime_delta(tsk, user); > + u64 other; > + > + if (!is_idle_task(tsk)) { > + if (user) > + tsk->vtime_acct_utime = 0; > + else > + tsk->vtime_acct_stime = 0; > + } Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators. Reset and accounting really should be done by the upper level functions vtime_account_*() Thanks.