Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933370AbcKVO2y convert rfc822-to-8bit (ORCPT ); Tue, 22 Nov 2016 09:28:54 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:44165 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932982AbcKVO2w (ORCPT ); Tue, 22 Nov 2016 09:28:52 -0500 Date: Tue, 22 Nov 2016 15:28:42 +0100 From: Martin Schwidefsky To: Frederic Weisbecker Cc: LKML , Tony Luck , Wanpeng Li , Peter Zijlstra , Michael Ellerman , Heiko Carstens , Benjamin Herrenschmidt , Thomas Gleixner , Paul Mackerras , Ingo Molnar , Fenghua Yu , Rik van Riel , Stanislaw Gruszka Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs In-Reply-To: <20161122134550.GA21436@lerouge> References: <1479406123-24785-1-git-send-email-fweisbec@gmail.com> <20161118130846.7da515cc@mschwide> <20161118144700.GA31560@lerouge> <20161121075956.2b36b3e3@mschwide> <20161121111728.13a0a3db@mschwide> <20161122134550.GA21436@lerouge> X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8BIT X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16112214-0040-0000-0000-0000030B9D0B X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16112214-0041-0000-0000-00001DF6AA35 Message-Id: <20161122152842.28fc95c0@mschwide> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-22_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611220258 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19467 Lines: 527 On Tue, 22 Nov 2016 14:45:56 +0100 Frederic Weisbecker wrote: > On Mon, Nov 21, 2016 at 11:17:28AM +0100, Martin Schwidefsky wrote: > > On Mon, 21 Nov 2016 07:59:56 +0100 > > Martin Schwidefsky wrote: > [...] > > @@ -110,34 +119,48 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) > > #endif > > : "=m" (S390_lowcore.last_update_timer), > > "=m" (S390_lowcore.last_update_clock)); > > - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; > > - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock; > > + clock = S390_lowcore.last_update_clock - clock; > > + timer -= S390_lowcore.last_update_timer; > > + > > + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) > > + S390_lowcore.guest_timer += timer; > > + else if (hardirq_count() - hardirq_offset) > > + S390_lowcore.hardirq_timer += timer; > > + else if (in_serving_softirq()) > > + S390_lowcore.softirq_timer += timer; > > + else > > + S390_lowcore.system_timer += timer; > > I initially thought that some code could be shared for that whole accumulation. Now I > don't know if it would be a good idea. An example would be to deal with the contexts above > in order to store the accumulation to the appropriate place. I thought about a common code inline function that returns the index (CPUTIME_SYSTEM, CPUTIME_IRQ, ..) for the current context. Did not look too appealing anymore after I type it down. > > - account_user_time(tsk, user, user_scaled); > > - account_system_time(tsk, hardirq_offset, system, system_scaled); > > + hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer; > > + tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer; > > + softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer; > > + tsk->thread.softirq_timer = S390_lowcore.softirq_timer; > > + S390_lowcore.steal_timer += > > + clock - user - guest - system - hardirq - softirq; > > + > > + /* Push account value */ > > + if (user) > > + account_user_time(tsk, user, scale_vtime(user)); > > + if (guest) > > + account_guest_time(tsk, guest, scale_vtime(guest)); > > + if (system) > > + account_sys_time(tsk, system, scale_vtime(system)); > > + if (hardirq) > > + account_hardirq_time(tsk, hardirq, scale_vtime(hardirq)); > > + if (softirq) > > + account_softirq_time(tsk, softirq, scale_vtime(softirq)); > > And doing that would be another part of the shared code. Right now I would feel more comfortable if that stays architecture code. The calculation up to the point where accout_xxx_time function can be called is definitely arch specific. Why try to do the accumulation in common code? I have the feeling that would just complicate the code for no good reason. > > > > steal = S390_lowcore.steal_timer; > > if ((s64) steal > 0) { > > @@ -145,16 +168,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) > > account_steal_time(steal); > > } > > > > - return virt_timer_forward(user + system); > > + return virt_timer_forward(user + guest + system + hardirq + softirq); > > } > > > > void vtime_task_switch(struct task_struct *prev) > > { > > do_account_vtime(prev, 0); > > prev->thread.user_timer = S390_lowcore.user_timer; > > + prev->thread.guest_timer = S390_lowcore.guest_timer; > > prev->thread.system_timer = S390_lowcore.system_timer; > > + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer; > > + prev->thread.softirq_timer = S390_lowcore.softirq_timer; > > S390_lowcore.user_timer = current->thread.user_timer; > > + S390_lowcore.guest_timer = current->thread.guest_timer; > > S390_lowcore.system_timer = current->thread.system_timer; > > + S390_lowcore.hardirq_timer = current->thread.hardirq_timer; > > + S390_lowcore.softirq_timer = current->thread.softirq_timer; > > } > > Ditto. Same here. The lowcore fields are too arch specific. > > > > /* > > @@ -174,31 +203,22 @@ void vtime_account_user(struct task_struct *tsk) > > */ > > void vtime_account_irq_enter(struct task_struct *tsk) > > { > > - u64 timer, system, system_scaled; > > + u64 timer; > > > > timer = S390_lowcore.last_update_timer; > > S390_lowcore.last_update_timer = get_vtimer(); > > - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; > > - > > - /* Update MT utilization calculation */ > > - if (smp_cpu_mtid && > > - time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies))) > > - update_mt_scaling(); > > - > > - system = S390_lowcore.system_timer - tsk->thread.system_timer; > > - S390_lowcore.steal_timer -= system; > > - tsk->thread.system_timer = S390_lowcore.system_timer; > > - system_scaled = system; > > - /* Do MT utilization scaling */ > > - if (smp_cpu_mtid) { > > - u64 mult = __this_cpu_read(mt_scaling_mult); > > - u64 div = __this_cpu_read(mt_scaling_div); > > - > > - system_scaled = (system_scaled * mult) / div; > > - } > > - account_system_time(tsk, 0, system, system_scaled); > > - > > - virt_timer_forward(system); > > + timer -= S390_lowcore.last_update_timer; > > + > > + if ((tsk->flags & PF_VCPU) && (irq_count() == 0)) > > + S390_lowcore.guest_timer += timer; > > + else if (hardirq_count()) > > + S390_lowcore.hardirq_timer += timer; > > + else if (in_serving_softirq()) > > + S390_lowcore.softirq_timer += timer; > > + else > > + S390_lowcore.system_timer += timer; > > And Ditto. It would be nice if we can find a solution to make the decision tree where to put the cputime delta into common code. > We could put together the accumulation in a common struct in s390_lowcore, > and its mirror in thread struct then have helpers take care of the contexts. > > How does that sound to you, would it help or hurt? My gut feeling is that the try to make the accumulation code common will hurt more than it helps. But we can certainly try and look at the result. I spent some more time on this, here is my current patch. For my part the patch is close to the final solution if we can agree on it. -- >From a8f5d41df5f32897335567ea9f5a61a716855d5d Mon Sep 17 00:00:00 2001 From: Martin Schwidefsky Date: Mon, 21 Nov 2016 10:44:10 +0100 Subject: [PATCH] s390/cputime: delayed accounting of system time The account_system_time() function is called with a cputime that occurred while running in the kernel. The function detects the current context of the CPU (system, guest, irq, or softirq) and accounts the time to the correct bucket. This forces the arch code to account the cputime for hardirq and softirq before entering and after leaving the context in question. Make account_guest_time non-static and add account_system_time_native, With these two functions the arch code can delay the accounting for system time. For s390 the accounting is done once per timer tick and for each task switch. Signed-off-by: Martin Schwidefsky --- arch/s390/include/asm/lowcore.h | 65 ++++++++++---------- arch/s390/include/asm/processor.h | 3 + arch/s390/kernel/vtime.c | 126 +++++++++++++++++++++++--------------- include/linux/kernel_stat.h | 4 ++ kernel/sched/cputime.c | 12 +++- 5 files changed, 127 insertions(+), 83 deletions(-) diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h index 62a5cf1..8a5b082 100644 --- a/arch/s390/include/asm/lowcore.h +++ b/arch/s390/include/asm/lowcore.h @@ -85,53 +85,56 @@ struct lowcore { __u64 mcck_enter_timer; /* 0x02c0 */ __u64 exit_timer; /* 0x02c8 */ __u64 user_timer; /* 0x02d0 */ - __u64 system_timer; /* 0x02d8 */ - __u64 steal_timer; /* 0x02e0 */ - __u64 last_update_timer; /* 0x02e8 */ - __u64 last_update_clock; /* 0x02f0 */ - __u64 int_clock; /* 0x02f8 */ - __u64 mcck_clock; /* 0x0300 */ - __u64 clock_comparator; /* 0x0308 */ + __u64 guest_timer; /* 0x02d8 */ + __u64 system_timer; /* 0x02e0 */ + __u64 hardirq_timer; /* 0x02e8 */ + __u64 softirq_timer; /* 0x02f0 */ + __u64 steal_timer; /* 0x02f8 */ + __u64 last_update_timer; /* 0x0300 */ + __u64 last_update_clock; /* 0x0308 */ + __u64 int_clock; /* 0x0310 */ + __u64 mcck_clock; /* 0x0318 */ + __u64 clock_comparator; /* 0x0320 */ /* Current process. */ - __u64 current_task; /* 0x0310 */ - __u8 pad_0x318[0x320-0x318]; /* 0x0318 */ - __u64 kernel_stack; /* 0x0320 */ + __u64 current_task; /* 0x0328 */ + __u8 pad_0x318[0x320-0x318]; /* 0x0330 */ + __u64 kernel_stack; /* 0x0338 */ /* Interrupt, panic and restart stack. */ - __u64 async_stack; /* 0x0328 */ - __u64 panic_stack; /* 0x0330 */ - __u64 restart_stack; /* 0x0338 */ + __u64 async_stack; /* 0x0340 */ + __u64 panic_stack; /* 0x0348 */ + __u64 restart_stack; /* 0x0350 */ /* Restart function and parameter. */ - __u64 restart_fn; /* 0x0340 */ - __u64 restart_data; /* 0x0348 */ - __u64 restart_source; /* 0x0350 */ + __u64 restart_fn; /* 0x0358 */ + __u64 restart_data; /* 0x0360 */ + __u64 restart_source; /* 0x0368 */ /* Address space pointer. */ - __u64 kernel_asce; /* 0x0358 */ - __u64 user_asce; /* 0x0360 */ + __u64 kernel_asce; /* 0x0370 */ + __u64 user_asce; /* 0x0378 */ /* * The lpp and current_pid fields form a * 64-bit value that is set as program * parameter with the LPP instruction. */ - __u32 lpp; /* 0x0368 */ - __u32 current_pid; /* 0x036c */ + __u32 lpp; /* 0x0380 */ + __u32 current_pid; /* 0x0384 */ /* SMP info area */ - __u32 cpu_nr; /* 0x0370 */ - __u32 softirq_pending; /* 0x0374 */ - __u64 percpu_offset; /* 0x0378 */ - __u64 vdso_per_cpu_data; /* 0x0380 */ - __u64 machine_flags; /* 0x0388 */ - __u32 preempt_count; /* 0x0390 */ - __u8 pad_0x0394[0x0398-0x0394]; /* 0x0394 */ - __u64 gmap; /* 0x0398 */ - __u32 spinlock_lockval; /* 0x03a0 */ - __u32 fpu_flags; /* 0x03a4 */ - __u8 pad_0x03a8[0x0400-0x03a8]; /* 0x03a8 */ + __u32 cpu_nr; /* 0x0388 */ + __u32 softirq_pending; /* 0x038c */ + __u64 percpu_offset; /* 0x0390 */ + __u64 vdso_per_cpu_data; /* 0x0398 */ + __u64 machine_flags; /* 0x03a0 */ + __u32 preempt_count; /* 0x03a8 */ + __u8 pad_0x03ac[0x03b0-0x03ac]; /* 0x03ac */ + __u64 gmap; /* 0x03b0 */ + __u32 spinlock_lockval; /* 0x03b8 */ + __u32 fpu_flags; /* 0x03bc */ + __u8 pad_0x03c0[0x0400-0x03c0]; /* 0x03c0 */ /* Per cpu primary space access list */ __u32 paste[16]; /* 0x0400 */ diff --git a/arch/s390/include/asm/processor.h b/arch/s390/include/asm/processor.h index bf8b2e2..0234eea 100644 --- a/arch/s390/include/asm/processor.h +++ b/arch/s390/include/asm/processor.h @@ -111,7 +111,10 @@ struct thread_struct { unsigned int acrs[NUM_ACRS]; unsigned long ksp; /* kernel stack pointer */ unsigned long user_timer; /* task cputime in user space */ + unsigned long guest_timer; /* task cputime in kvm guest */ unsigned long system_timer; /* task cputime in kernel space */ + unsigned long hardirq_timer; /* task cputime in hardirq context */ + unsigned long softirq_timer; /* task cputime in softirq context */ unsigned long sys_call_table; /* system call table address */ mm_segment_t mm_segment; unsigned long gmap_addr; /* address of last gmap fault. */ diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c index 9a6c957..b6de91e 100644 --- a/arch/s390/kernel/vtime.c +++ b/arch/s390/kernel/vtime.c @@ -90,14 +90,30 @@ static void update_mt_scaling(void) __this_cpu_write(mt_scaling_jiffies, jiffies_64); } +static inline u64 update_tsk_timer(unsigned long *tsk_vtime, u64 new) +{ + u64 delta; + + delta = new - *tsk_vtime; + *tsk_vtime = new; + return delta; +} + +static inline u64 scale_vtime(u64 vtime) +{ + u64 mult = __this_cpu_read(mt_scaling_mult); + u64 div = __this_cpu_read(mt_scaling_div); + + return smp_cpu_mtid ? (vtime * mult / div) : vtime; +} + /* * Update process times based on virtual cpu times stored by entry.S * to the lowcore fields user_timer, system_timer & steal_clock. */ -static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) +static int do_account_vtime(struct task_struct *tsk) { - u64 timer, clock, user, system, steal; - u64 user_scaled, system_scaled; + u64 timer, clock, user, guest, system, hardirq, softirq, steal; timer = S390_lowcore.last_update_timer; clock = S390_lowcore.last_update_clock; @@ -110,34 +126,47 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) #endif : "=m" (S390_lowcore.last_update_timer), "=m" (S390_lowcore.last_update_clock)); - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock; + clock = S390_lowcore.last_update_clock - clock; + timer -= S390_lowcore.last_update_timer; + + if (hardirq_count()) + S390_lowcore.hardirq_timer += timer; + else + S390_lowcore.system_timer += timer; /* Update MT utilization calculation */ if (smp_cpu_mtid && time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies))) update_mt_scaling(); - user = S390_lowcore.user_timer - tsk->thread.user_timer; - S390_lowcore.steal_timer -= user; - tsk->thread.user_timer = S390_lowcore.user_timer; - - system = S390_lowcore.system_timer - tsk->thread.system_timer; - S390_lowcore.steal_timer -= system; - tsk->thread.system_timer = S390_lowcore.system_timer; - - user_scaled = user; - system_scaled = system; - /* Do MT utilization scaling */ - if (smp_cpu_mtid) { - u64 mult = __this_cpu_read(mt_scaling_mult); - u64 div = __this_cpu_read(mt_scaling_div); - - user_scaled = (user_scaled * mult) / div; - system_scaled = (system_scaled * mult) / div; - } - account_user_time(tsk, user, user_scaled); - account_system_time(tsk, hardirq_offset, system, system_scaled); + /* Calculate cputime delta */ + user = update_tsk_timer(&tsk->thread.user_timer, + READ_ONCE(S390_lowcore.user_timer)); + guest = update_tsk_timer(&tsk->thread.guest_timer, + READ_ONCE(S390_lowcore.guest_timer)); + system = update_tsk_timer(&tsk->thread.system_timer, + READ_ONCE(S390_lowcore.system_timer)); + hardirq = update_tsk_timer(&tsk->thread.hardirq_timer, + READ_ONCE(S390_lowcore.hardirq_timer)); + softirq = update_tsk_timer(&tsk->thread.softirq_timer, + READ_ONCE(S390_lowcore.softirq_timer)); + S390_lowcore.steal_timer += + clock - user - guest - system - hardirq - softirq; + + /* Push accounting values */ + if (user) + account_user_time(tsk, user, scale_vtime(user)); + if (guest) + account_guest_time(tsk, guest, scale_vtime(guest)); + if (system) + account_system_time_native(tsk, system, scale_vtime(system), + CPUTIME_SYSTEM); + if (hardirq) + account_system_time_native(tsk, hardirq, scale_vtime(hardirq), + CPUTIME_IRQ); + if (softirq) + account_system_time_native(tsk, softirq, scale_vtime(softirq), + CPUTIME_SOFTIRQ); steal = S390_lowcore.steal_timer; if ((s64) steal > 0) { @@ -145,16 +174,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset) account_steal_time(steal); } - return virt_timer_forward(user + system); + return virt_timer_forward(timer); } void vtime_task_switch(struct task_struct *prev) { - do_account_vtime(prev, 0); + do_account_vtime(prev); prev->thread.user_timer = S390_lowcore.user_timer; + prev->thread.guest_timer = S390_lowcore.guest_timer; prev->thread.system_timer = S390_lowcore.system_timer; + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer; + prev->thread.softirq_timer = S390_lowcore.softirq_timer; S390_lowcore.user_timer = current->thread.user_timer; + S390_lowcore.guest_timer = current->thread.guest_timer; S390_lowcore.system_timer = current->thread.system_timer; + S390_lowcore.hardirq_timer = current->thread.hardirq_timer; + S390_lowcore.softirq_timer = current->thread.softirq_timer; } /* @@ -164,7 +199,7 @@ void vtime_task_switch(struct task_struct *prev) */ void vtime_account_user(struct task_struct *tsk) { - if (do_account_vtime(tsk, HARDIRQ_OFFSET)) + if (do_account_vtime(tsk)) virt_timer_expire(); } @@ -174,31 +209,22 @@ void vtime_account_user(struct task_struct *tsk) */ void vtime_account_irq_enter(struct task_struct *tsk) { - u64 timer, system, system_scaled; + u64 timer; timer = S390_lowcore.last_update_timer; S390_lowcore.last_update_timer = get_vtimer(); - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer; - - /* Update MT utilization calculation */ - if (smp_cpu_mtid && - time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies))) - update_mt_scaling(); - - system = S390_lowcore.system_timer - tsk->thread.system_timer; - S390_lowcore.steal_timer -= system; - tsk->thread.system_timer = S390_lowcore.system_timer; - system_scaled = system; - /* Do MT utilization scaling */ - if (smp_cpu_mtid) { - u64 mult = __this_cpu_read(mt_scaling_mult); - u64 div = __this_cpu_read(mt_scaling_div); - - system_scaled = (system_scaled * mult) / div; - } - account_system_time(tsk, 0, system, system_scaled); - - virt_timer_forward(system); + timer -= S390_lowcore.last_update_timer; + + if ((tsk->flags & PF_VCPU) && (irq_count() == 0)) + S390_lowcore.guest_timer += timer; + else if (hardirq_count()) + S390_lowcore.hardirq_timer += timer; + else if (in_serving_softirq()) + S390_lowcore.softirq_timer += timer; + else + S390_lowcore.system_timer += timer; + + virt_timer_forward(timer); } EXPORT_SYMBOL_GPL(vtime_account_irq_enter); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 44fda64..a7e7951 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -80,10 +80,14 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu) extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); +extern void account_guest_time(struct task_struct *, cputime_t, cputime_t); extern void account_steal_time(cputime_t); extern void account_idle_time(cputime_t); #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE +extern void account_system_time_native(struct task_struct *, cputime_t, + cputime_t, int); + static inline void account_process_tick(struct task_struct *tsk, int user) { vtime_account_user(tsk); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5ebee31..9e6c5aa 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -155,8 +155,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime, * @cputime: the cpu time spent in virtual machine since the last update * @cputime_scaled: cputime scaled by cpu frequency */ -static void account_guest_time(struct task_struct *p, cputime_t cputime, - cputime_t cputime_scaled) +void account_guest_time(struct task_struct *p, cputime_t cputime, + cputime_t cputime_scaled) { u64 *cpustat = kcpustat_this_cpu->cpustat; @@ -199,6 +199,14 @@ void __account_system_time(struct task_struct *p, cputime_t cputime, acct_account_cputime(p); } +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE +void account_system_time_native(struct task_struct *p, cputime_t cputime, + cputime_t cputime_scaled, int index) +{ + __account_system_time(p, cputime, cputime_scaled, index); +} +#endif + /* * Account system cpu time to a process. * @p: the process that the cpu time gets accounted to -- 2.8.4 -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.