Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754349AbcKUQUM (ORCPT ); Mon, 21 Nov 2016 11:20:12 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:35893 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752505AbcKUQUL (ORCPT ); Mon, 21 Nov 2016 11:20:11 -0500 Date: Mon, 21 Nov 2016 17:20:06 +0100 From: Frederic Weisbecker To: Martin Schwidefsky Cc: LKML , Tony Luck , Wanpeng Li , Peter Zijlstra , Michael Ellerman , Heiko Carstens , Benjamin Herrenschmidt , Thomas Gleixner , Paul Mackerras , Ingo Molnar , Fenghua Yu , Rik van Riel , Stanislaw Gruszka Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Message-ID: <20161121162003.GB7554@lerouge> References: <1479406123-24785-1-git-send-email-fweisbec@gmail.com> <20161118130846.7da515cc@mschwide> <20161118144700.GA31560@lerouge> <20161121075956.2b36b3e3@mschwide> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161121075956.2b36b3e3@mschwide> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6104 Lines: 127 On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote: > On Fri, 18 Nov 2016 15:47:02 +0100 > Frederic Weisbecker wrote: > > Just because some code isn't too complex doesn't mean we really want to keep it. > > I get regular questions about what unit does cputime_t map to on a given > > configuration. Everybody gets confused about that. On many of the > > patches we got on cputime for the last years, I had to fix quite some issues > > with bad granularity assumption. In fact most fixes that came to kernel/sched/cputime.c > > recently, after merge or review, were about people getting confused with cputime_t granularity. > > These regular question you get about the cputime_t is exactly what I was referring > to. If the value would just be a u64 the guys asking the question about cputime_t > would just assume the value to be nano-seconds and then go ahead and break things. Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But changing it to nsecs and expect people to deduce it from the u64 type sounds a good direction. > > > Especially for stats that come from nsecs clocks (steal and irqtime), we always have to maintain an > > accumulator and make sure we don't lose some nanosec deltas. > > Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case. Right. > > > And we have to maintain several workarounds, sometimes even in the fastpath in > > order to cope with the cputime_t random granularity all over. > > > > Some fastpath examples: > > > > * steal time accounting (need to convert nsecs to cputime then back) > > * irqtime accounting (maintain accumulators) > > * cputime_adjust, used on any user read of cputime (need to convert from nsecs > > to cputime on cputime_adjust) > > > > But the worst really is about maintainance. This patchset removes around 600 lines. > > Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from > the generic cputime_jiffies.h and cputime_nsecs.h. Well, still worth it :-) > > > The do_account_vtime function is called once per jiffy and once per task > > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy > > > would not be so bad, but the call on the scheduling path *will* hurt. > > > > I don't think we need to flush on task switch. If we maintain the accumulators > > on the task/thread struct instead of per-cpu, then the remaining time after > > task switch out will be accounted on next tick after after next task switch in. > > You can not properly calculate steal time if you allow sleeping tasks to sit on > up to 5*HZ worth of cpu time. Ah, you mean that when the task goes to sleep, we shouldn't miss more than one tick worth of system/user time but the steal time can be much higher, right? > I think we *have* to do accounting on task switch. > At least on s390, likely on powerpc as well. Why not make that an option for > the architecture with the yet-to-be-written accumulating code. Ok, how about doing the accumulation and always account on task switch for now, we'll see later if it's worth having such an option. > > > > What is even worse is the vtime_account_irq_enter path, that is call several > > > times for each *interrupt*, at least two times for an interrupt without > > > additional processing and four times if a softirq is triggered. > > > > Actually maintaining an accumulator to flush on ticks is probably going to increase > > the perf because of that. account_system_time() is called twice per interrupt, and > > such function do much more than just account the time to the task_struct and cpustat > > fields. The same applies to userspace boundaries and context switch. The account_*_time() > > functions can be expensive. > > The account_system_time twice per interrupt can be removed with the accumulation > idea. We will have to see how expensive the accounting_xxx_time calls are on > the context switch path. Right. > > > > > > > Now it has been proposed to implement lazy accounting to accumulate deltas > > > and do the expensive conversions only infrequently. This is pretty straight- > > > forward for account_user_time but to do this for the account_system_time > > > function is more complicated. The function has to differentiate between > > > guest/hardirq/softirq and pure system time. We would need to keep sums for > > > each bucket and provide a separate function to add to each bucket. Like > > > account_guest_time(), account_hardirq_time(), account_softirq_time() and > > > account_system_time(). Then it is up to the arch code to sort out the details > > > and call the accounting code once per jiffy for each of the buckets. > > > > That wouldn't be too hard really. The s390 code in vtime.c already does that. > > Yes, I agree that the accumulating change would not be too hard. Can I make the > request that we try to get that done first before doing the cleanup ? Of course. I see you started something, I'll be glad to help! > > > > We still have to do the whole thing on each task switch though. > > > > Not if we maintain the deltas in the task_struct. > > > > > > > > But I am still not happy about the approach. What is the compelling reason > > > for this change except for the "but it looks ugly"? > > > > The diffstat (600 lines removed). Also the fact that we have all these workarounds > > in the core code just for the special case of 1 arch (s390) and a half > > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE). > > > > I'd much rather have all that complexity moved in a vtime_native.c shared by s390 and powerpc > > that takes care of proper accumulation in cputime_t and flushes that on ticks in nsecs rather > > than having all these cputime_t game all over the kernel. > > The goal to have nano-seconds only in the core code is a good one. And with the > accumulator I think s390 can live with it. The change would have a real upside > too. There are these stupid divisions for scaled cputime that we have to calculate > for every call to account_xxx_time(). These would not be done for the interrupts > anymore. Exactly! Thanks.