Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755992AbcKVNqS (ORCPT ); Tue, 22 Nov 2016 08:46:18 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:35175 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755650AbcKVNqI (ORCPT ); Tue, 22 Nov 2016 08:46:08 -0500 Date: Tue, 22 Nov 2016 14:45:56 +0100 From: Frederic Weisbecker To: Martin Schwidefsky Cc: LKML , Tony Luck , Wanpeng Li , Peter Zijlstra , Michael Ellerman , Heiko Carstens , Benjamin Herrenschmidt , Thomas Gleixner , Paul Mackerras , Ingo Molnar , Fenghua Yu , Rik van Riel , Stanislaw Gruszka Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Message-ID: <20161122134550.GA21436@lerouge> References: <1479406123-24785-1-git-send-email-fweisbec@gmail.com> <20161118130846.7da515cc@mschwide> <20161118144700.GA31560@lerouge> <20161121075956.2b36b3e3@mschwide> <20161121111728.13a0a3db@mschwide> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161121111728.13a0a3db@mschwide> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8878 Lines: 222 On Mon, Nov 21, 2016 at 11:17:28AM +0100, Martin Schwidefsky wrote: > On Mon, 21 Nov 2016 07:59:56 +0100 > Martin Schwidefsky wrote: > > > On Fri, 18 Nov 2016 15:47:02 +0100 > > Frederic Weisbecker wrote: > > > > > On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote: > > > > On Thu, 17 Nov 2016 19:08:07 +0100 > > > > Frederic Weisbecker wrote: > > > > > > > > Now it has been proposed to implement lazy accounting to accumulate deltas > > > > and do the expensive conversions only infrequently. This is pretty straight- > > > > forward for account_user_time but to do this for the account_system_time > > > > function is more complicated. The function has to differentiate between > > > > guest/hardirq/softirq and pure system time. We would need to keep sums for > > > > each bucket and provide a separate function to add to each bucket. Like > > > > account_guest_time(), account_hardirq_time(), account_softirq_time() and > > > > account_system_time(). Then it is up to the arch code to sort out the details > > > > and call the accounting code once per jiffy for each of the buckets. > > > > > > That wouldn't be too hard really. The s390 code in vtime.c already does that. > > > > Yes, I agree that the accumulating change would not be too hard. Can I make the > > request that we try to get that done first before doing the cleanup ? > > Played with the idea a bit, here is a prototype patch to do the delay system time > accounting for s390. It applies against the latest s390 features tree which you'll > find here > > git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features > > The details probably needs some more work but it works. > > -- > From 1b5ef9ddf899da81a48de826f783b15e6fc45d25 Mon Sep 17 00:00:00 2001 > From: Martin Schwidefsky > Date: Mon, 21 Nov 2016 10:44:10 +0100 > Subject: [PATCH] s390/cputime: delayed accounting of system time > > The account_system_time() function is called with a cputime that > occurred while running in the kernel. The function detects which > context the CPU is currently running in and accounts the time to > the correct bucket. This forces the arch code to account the > cputime for hardirq and softirq immediately. > > Make account_guest_time non-static and add account_sys_time, > account_hardirq_time and account_softirq_time. With these functions > the arch code can delay the accounting for system time. For s390 > the accounting is done once per timer tick and for each task switch. > > Signed-off-by: Martin Schwidefsky Thanks a lot for taking care of that! I'll give a try to do the same on powerpc. A few comments below: > --- > arch/s390/include/asm/lowcore.h | 65 ++++++++++++----------- > arch/s390/include/asm/processor.h | 3 ++ > arch/s390/kernel/vtime.c | 106 ++++++++++++++++++++++---------------- > include/linux/kernel_stat.h | 13 +++-- > kernel/sched/cputime.c | 22 +++++++- > 5 files changed, 129 insertions(+), 80 deletions(-) > > diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h > index 62a5cf1..8a5b082 100644 > --- a/arch/s390/include/asm/lowcore.h > +++ b/arch/s390/include/asm/lowcore.h [...] > @@ -110,34 +119,48 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) > #endif > : "=m" (S390_lowcore.last_update_timer), > "=m" (S390_lowcore.last_update_clock)); > - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; > - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock; > + clock = S390_lowcore.last_update_clock - clock; > + timer -= S390_lowcore.last_update_timer; > + > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) > + S390_lowcore.guest_timer += timer; > + else if (hardirq_count() - hardirq_offset) > + S390_lowcore.hardirq_timer += timer; > + else if (in_serving_softirq()) > + S390_lowcore.softirq_timer += timer; > + else > + S390_lowcore.system_timer += timer; I initially thought that some code could be shared for that whole accumulation. Now I don't know if it would be a good idea. An example would be to deal with the contexts above in order to store the accumulation to the appropriate place. > > /* Update MT utilization calculation */ > if (smp_cpu_mtid && > time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies))) > update_mt_scaling(); > > + /* Calculate cputime delta */ > user = S390_lowcore.user_timer - tsk->thread.user_timer; > - S390_lowcore.steal_timer -= user; > tsk->thread.user_timer = S390_lowcore.user_timer; > - > + guest = S390_lowcore.guest_timer - tsk->thread.guest_timer; > + tsk->thread.guest_timer = S390_lowcore.guest_timer; > system = S390_lowcore.system_timer - tsk->thread.system_timer; > - S390_lowcore.steal_timer -= system; > tsk->thread.system_timer = S390_lowcore.system_timer; > - > - user_scaled = user; > - system_scaled = system; > - /* Do MT utilization scaling */ > - if (smp_cpu_mtid) { > - u64 mult = __this_cpu_read(mt_scaling_mult); > - u64 div = __this_cpu_read(mt_scaling_div); > - > - user_scaled = (user_scaled * mult) / div; > - system_scaled = (system_scaled * mult) / div; > - } > - account_user_time(tsk, user, user_scaled); > - account_system_time(tsk, hardirq_offset, system, system_scaled); > + hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer; > + tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer; > + softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer; > + tsk->thread.softirq_timer = S390_lowcore.softirq_timer; > + S390_lowcore.steal_timer += > + clock - user - guest - system - hardirq - softirq; > + > + /* Push account value */ > + if (user) > + account_user_time(tsk, user, scale_vtime(user)); > + if (guest) > + account_guest_time(tsk, guest, scale_vtime(guest)); > + if (system) > + account_sys_time(tsk, system, scale_vtime(system)); > + if (hardirq) > + account_hardirq_time(tsk, hardirq, scale_vtime(hardirq)); > + if (softirq) > + account_softirq_time(tsk, softirq, scale_vtime(softirq)); And doing that would be another part of the shared code. > > steal = S390_lowcore.steal_timer; > if ((s64) steal > 0) { > @@ -145,16 +168,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) > account_steal_time(steal); > } > > - return virt_timer_forward(user + system); > + return virt_timer_forward(user + guest + system + hardirq + softirq); > } > > void vtime_task_switch(struct task_struct *prev) > { > do_account_vtime(prev, 0); > prev->thread.user_timer = S390_lowcore.user_timer; > + prev->thread.guest_timer = S390_lowcore.guest_timer; > prev->thread.system_timer = S390_lowcore.system_timer; > + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer; > + prev->thread.softirq_timer = S390_lowcore.softirq_timer; > S390_lowcore.user_timer = current->thread.user_timer; > + S390_lowcore.guest_timer = current->thread.guest_timer; > S390_lowcore.system_timer = current->thread.system_timer; > + S390_lowcore.hardirq_timer = current->thread.hardirq_timer; > + S390_lowcore.softirq_timer = current->thread.softirq_timer; > } Ditto. > > /* > @@ -174,31 +203,22 @@ void vtime_account_user(struct task_struct *tsk) > */ > void vtime_account_irq_enter(struct task_struct *tsk) > { > - u64 timer, system, system_scaled; > + u64 timer; > > timer = S390_lowcore.last_update_timer; > S390_lowcore.last_update_timer = get_vtimer(); > - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; > - > - /* Update MT utilization calculation */ > - if (smp_cpu_mtid && > - time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies))) > - update_mt_scaling(); > - > - system = S390_lowcore.system_timer - tsk->thread.system_timer; > - S390_lowcore.steal_timer -= system; > - tsk->thread.system_timer = S390_lowcore.system_timer; > - system_scaled = system; > - /* Do MT utilization scaling */ > - if (smp_cpu_mtid) { > - u64 mult = __this_cpu_read(mt_scaling_mult); > - u64 div = __this_cpu_read(mt_scaling_div); > - > - system_scaled = (system_scaled * mult) / div; > - } > - account_system_time(tsk, 0, system, system_scaled); > - > - virt_timer_forward(system); > + timer -= S390_lowcore.last_update_timer; > + > + if ((tsk->flags & PF_VCPU) && (irq_count() == 0)) > + S390_lowcore.guest_timer += timer; > + else if (hardirq_count()) > + S390_lowcore.hardirq_timer += timer; > + else if (in_serving_softirq()) > + S390_lowcore.softirq_timer += timer; > + else > + S390_lowcore.system_timer += timer; And Ditto. We could put together the accumulation in a common struct in s390_lowcore, and its mirror in thread struct then have helpers take care of the contexts. How does that sound to you, would it help or hurt? Thanks.