Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp8631922ybi; Tue, 23 Jul 2019 12:06:15 -0700 (PDT) X-Google-Smtp-Source: APXvYqwRxwNjkE7K33gVpr/J69VXH0N83dSX3WQKkMDCUzBLWHwZ3Skq5O1LFLsFBDw6stO+uNL0 X-Received: by 2002:a63:5a4b:: with SMTP id k11mr18632320pgm.143.1563908775361; Tue, 23 Jul 2019 12:06:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563908775; cv=none; d=google.com; s=arc-20160816; b=pyZsBRKgJU+npHaZJ1ty701w5AgjuUwnL7VoM3LalIPCuKTO04CETk0ny3AP5rU/rp Y4wQm+zdRWSBaAtpcs4oBWHceHBTXXRxzNCtbvPJKq27aGCPs0GAtBPRGlBy0QQdPF1S oZSmzXj89G2+8zGTbeU6aglinQCB5qNn1mPU3hOQMD3rh65CjaIKLcREz8y0O3lJLhVZ SH/3HvQxfGIHcazZR5PaRaKfypRIsHRw+JuQxmMOK3PJSlK/MnZXLZfhQRQV/ZdEZwOd NYvkE/3Siql25NwlZARPsQLfdWATYILXHFpMOlokZfi8we3/TAy/T0UDxGwTA/yzezh/ L86g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=AuVMdkAMTE+BzOKhpX3H0oV59GNL1fBx6wpPcm7MytM=; b=TSOrgSQMAhUhd2zNaE8hE9WEJTChjpnXMCQO5ICe4fC2xa6pkxmn8LFYtBQd/T0AU1 9vwkgtG/VYh9rZ16KHPy2c58GKdX6pBNrAhabZeIOaSJV6emsqajSE/VxSZRAtGSlSQZ MsfHsMz4TITgSLmldRn/n5Bs3hlV9iBYJfAKD4TzXc59vn7tlU/8jGgucTQGIZKhFCqa EArRxRcOqvgz5h+GacO9+yNzaHR11I3lfbTPOc7Owt003QC6ciuZKjyZTkySbHmWGMq+ YoVj3CFGsTKx1fCv4vC/HZMuVCMX+qRHY9wO03/rKeUH8aZL5evY0Vfg9eAnj+iMA3+0 peGA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x19si14196498pgi.515.2019.07.23.12.05.58; Tue, 23 Jul 2019 12:06:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729427AbfGWJiF (ORCPT + 99 others); Tue, 23 Jul 2019 05:38:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59284 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728284AbfGWJiE (ORCPT ); Tue, 23 Jul 2019 05:38:04 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 53D6181F0D; Tue, 23 Jul 2019 09:38:04 +0000 (UTC) Received: from localhost (unknown [10.43.2.114]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9AF065F7C0; Tue, 23 Jul 2019 09:38:01 +0000 (UTC) Date: Tue, 23 Jul 2019 11:37:42 +0200 From: Stanislaw Gruszka To: Peter Zijlstra Cc: Oleg Nesterov , Ingo Molnar , Thomas Gleixner , Andrew Fox , Stephen Johnston , linux-kernel@vger.kernel.org, Linus Torvalds Subject: Re: [PATCH] sched/cputime: make scale_stime() more precise Message-ID: <20190723093741.GB2955@redhat.com> References: <20190718131834.GA22211@redhat.com> <20190719110349.GG3419@hirez.programming.kicks-ass.net> <20190722105240.GA27219@redhat.com> <20190722200034.GJ6698@worktop.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190722200034.GJ6698@worktop.programming.kicks-ass.net> User-Agent: Mutt/1.5.20 (2009-12-10) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Tue, 23 Jul 2019 09:38:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 22, 2019 at 10:00:34PM +0200, Peter Zijlstra wrote: > On Mon, Jul 22, 2019 at 12:52:41PM +0200, Stanislaw Gruszka wrote: > > On Fri, Jul 19, 2019 at 01:03:49PM +0200, Peter Zijlstra wrote: > > > > shows the problem even when sum_exec_runtime is not that big: 300000 secs. > > > > > > > > The new implementation of scale_stime() does the additional div64_u64_rem() > > > > in a loop but see the comment, as long it is used by cputime_adjust() this > > > > can happen only once. > > > > > > That only shows something after long long staring :/ There's no words on > > > what the output actually means or what would've been expected. > > > > > > Also, your example is incomplete; the below is a test for scale_stime(); > > > from this we can see that the division results in too large a number, > > > but, important for our use-case in cputime_adjust(), it is a step > > > function (due to loss in precision) and for every plateau we shift > > > runtime into the wrong bucket. > > > > > > Your proposed function works; but is atrocious, esp. on 32bit. That > > > said, before we 'fixed' it, it had similar horrible divisions in, see > > > commit 55eaa7c1f511 ("sched: Avoid cputime scaling overflow"). > > > > > > Included below is also an x86_64 implementation in 2 instructions. > > > > > > I'm still trying see if there's anything saner we can do... > > > > I was always proponent of removing scaling and export raw values > > and sum_exec_runtime. But that has obvious drawback, reintroduce > > 'top hiding' issue. > > I think (but didn't grep) that we actually export sum_exec_runtime in > /proc/ *somewhere*. Yes, it is, in /proc/[PID]/schedstat for CONFIG_SCHEDSTAT=y and in /proc/[PID]/sched for CONFIG_SCHED_DEBUG=y > > But maybe we can export raw values in separate file i.e. > > /proc/[pid]/raw_cpu_times ? So applications that require more precise > > cputime values for very long-living processes can use this file. > > There are no raw cpu_times, there are system and user samples, and > samples are, by their very nature, an approximation. We just happen to > track the samples in TICK_NSEC resolution these days, but they're still > ticks (except on s390 and maybe other archs, which do time accounting in > the syscall path). > > But I think you'll find x86 people are quite opposed to doing TSC reads > in syscall entry and exit :-) By 'raw' I meant values that are stored in task_struct without processing by cputime_adjust() -> scale_stime(). Idea is that user space can make scaling using floating point, so do not have to drop precision if numbers are big. That was discussed on my RFC and PATCH series posts: https://lore.kernel.org/lkml/1364489605-5443-1-git-send-email-sgruszka@redhat.com/ https://lore.kernel.org/lkml/1365066635-2959-1-git-send-email-sgruszka@redhat.com/ There was objection that even if kernel does not tell what utime/stime values mean exactly, users already got used to scaled behaviour and changing this is 'semantic' ABI breakage. And obviously this would make 'top hiding' worm working again for unpatched top. So perhaps we can add new exports of not-scaled utime/stime and achieve the same goal without changing the meaning of existing fields. From other hand, maybe nowadays better tools exist to provide exact cputimes to user space i.e. 'perf stat' or bpf, so proposed addition is unneeded. Stanislaw