Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp2295635ybi; Thu, 18 Jul 2019 06:19:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqxyN7jTb6LoZlY3tMWvSKxYXTf87v6LzoIWtoDcAnQGLLHl3o49R5HPeFB7oQV0WjGLr8dx X-Received: by 2002:a63:130f:: with SMTP id i15mr46571442pgl.158.1563455976047; Thu, 18 Jul 2019 06:19:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563455976; cv=none; d=google.com; s=arc-20160816; b=vdlyh1d8Pt689BjUkYS5X411wfEN1B/IbXLcA8En8u74jIWK3ep3zXVi09aEkcZKWz z3av+yb0D2UqQO7Foz/GiEh52Xgax67lbZzDaB0Wskkmn71EvYvbJ3CTd9zexFXYbuga 8y868DKEXyLZXs0I5t5xTYJIRsrFY/Liyjg6GmwUYSbzHjod8kzZSWoub9M3dMgGMSff MwNd99MmJcIROfhLFx54j+dKw5eFnUKyUKd0FzegpmcWbAoIg0nu6gMSMFpHdSOy+DOk 9OP5m3vP1kkfIGJ3oT/aotj5QqIvFe1j+OH7lrk0lD9wWgBmzD07I5OwkSIqAYZyYbzy F6bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:content-disposition :mime-version:message-id:subject:cc:to:from:date; bh=IW6cUA2d3ettyPVJ/AlX2L6eCda33SMto+ZdKMyVMz0=; b=NvX0wSYKFYsA4Plv0DTjMN11yWpv4o//SPQ0/DUUgExZLNEjm5kiqCo/gw6dFWdomx y5E5HHGAOQPPQf7iWWO+VazjrIGL+jPPYmHM9/+Euqr7q+tJPBOOaLwgSeAeZE8+VdPj ilNCKlDRdjRg582aB4Loqz/RXZiTA9OKO58uSKDY7DghrrTgd36JjnxmMLzIqyYVH1vb kMMfJ+URXpQ1CeWrwuhNnQTVL0WKRZMBBG8gOiNTSgT0EuD1RQwEGzjV17N9mUltfESI N8Kh/R3pT5owiwWvChEVguoe4+6FMXa1vqf32LuLJf1VyK+/PqMcJ5nq0hBPwGEtZJbi FY+w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i22si1217169pju.59.2019.07.18.06.19.19; Thu, 18 Jul 2019 06:19:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390234AbfGRNSh (ORCPT + 99 others); Thu, 18 Jul 2019 09:18:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46743 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726715AbfGRNSg (ORCPT ); Thu, 18 Jul 2019 09:18:36 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3731B30842A1; Thu, 18 Jul 2019 13:18:36 +0000 (UTC) Received: from dhcp-27-174.brq.redhat.com (unknown [10.43.17.136]) by smtp.corp.redhat.com (Postfix) with SMTP id E660160635; Thu, 18 Jul 2019 13:18:34 +0000 (UTC) Received: by dhcp-27-174.brq.redhat.com (nbSMTP-1.00) for uid 1000 oleg@redhat.com; Thu, 18 Jul 2019 15:18:35 +0200 (CEST) Date: Thu, 18 Jul 2019 15:18:34 +0200 From: Oleg Nesterov To: Ingo Molnar , Peter Zijlstra , Thomas Gleixner Cc: Andrew Fox , Stephen Johnston , linux-kernel@vger.kernel.org Subject: [PATCH] sched/cputime: make scale_stime() more precise Message-ID: <20190718131834.GA22211@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.40]); Thu, 18 Jul 2019 13:18:36 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org People report that utime and stime from /proc//stat become very wrong when the numbers are big enough. In particular, the monitored application can run all the time in user-space but only stime grows. This is because scale_stime() is very inaccurate. It tries to minimize the relative error, but the absolute error can be huge. Andrew wrote the test-case: int main(int argc, char **argv) { struct task_cputime c; struct prev_cputime p; u64 st, pst, cst; u64 ut, put, cut; u64 x; int i = -1; // one step not printed if (argc != 2) { printf("usage: %s \n", argv[0]); return 1; } x = strtoull(argv[1], NULL, 0) * SEC; printf("start=%lld\n", x); p.stime = 0; p.utime = 0; while (i++ < NSTEPS) { x += STEP; c.stime = x; c.utime = x; c.sum_exec_runtime = x + x; pst = cputime_to_clock_t(p.stime); put = cputime_to_clock_t(p.utime); cputime_adjust(&c, &p, &ut, &st); cst = cputime_to_clock_t(st); cut = cputime_to_clock_t(ut); if (i) printf("ut(diff)/st(diff): %20lld (%4lld) %20lld (%4lld)\n", cut, cut - put, cst, cst - pst); } } For example, $ ./stime 300000 start=300000000000000 ut(diff)/st(diff): 299994875 ( 0) 300009124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300011124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300013124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300015124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300017124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300019124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300021124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300023124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300025124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300027124 (2000) ut(diff)/st(diff): 299994875 ( 0) 300029124 (2000) ut(diff)/st(diff): 299996875 (2000) 300029124 ( 0) ut(diff)/st(diff): 299998875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300000875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300002875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300004875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300006875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300008875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300010875 (2000) 300029124 ( 0) ut(diff)/st(diff): 300012055 (1180) 300029944 ( 820) ut(diff)/st(diff): 300012055 ( 0) 300031944 (2000) ut(diff)/st(diff): 300012055 ( 0) 300033944 (2000) ut(diff)/st(diff): 300012055 ( 0) 300035944 (2000) ut(diff)/st(diff): 300012055 ( 0) 300037944 (2000) shows the problem even when sum_exec_runtime is not that big: 300000 secs. The new implementation of scale_stime() does the additional div64_u64_rem() in a loop but see the comment, as long it is used by cputime_adjust() this can happen only once. Reported-by: Andrew Fox Signed-off-by: Oleg Nesterov --- kernel/sched/cputime.c | 66 ++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 2305ce8..ad055a3 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -525,47 +525,55 @@ void account_idle_ticks(unsigned long ticks) } /* - * Perform (stime * rtime) / total, but avoid multiplication overflow by - * losing precision when the numbers are big. + * Perform (stime * rtime) / total, but avoid multiplication overflow + * by losing precision when the numbers are big. + * + * NOTE! currently the only user is cputime_adjust() and thus + * + * stime < total && rtime > total + * + * this means that the end result is always precise and the additional + * div64_u64_rem() inside the main loop is called at most once. */ static u64 scale_stime(u64 stime, u64 rtime, u64 total) { - u64 scaled; + u64 res = 0, div, rem; - for (;;) { - /* Make sure "rtime" is the bigger of stime/rtime */ + /* can stime * rtime overflow ? */ + while (ilog2(stime) + ilog2(rtime) > 62) { if (stime > rtime) swap(rtime, stime); - /* Make sure 'total' fits in 32 bits */ - if (total >> 32) - goto drop_precision; - - /* Does rtime (and thus stime) fit in 32 bits? */ - if (!(rtime >> 32)) - break; - - /* Can we just balance rtime/stime rather than dropping bits? */ - if (stime >> 31) - goto drop_precision; - - /* We can grow stime and shrink rtime and try to make them both fit */ - stime <<= 1; - rtime >>= 1; - continue; + if (rtime >= total) { + /* + * (rtime * stime) / total is equal to + * + * (rtime / total) * stime + + * (rtime % total) * stime / total + * + * if nothing overflows. Can the 1st multiplication + * overflow? Yes, but we do not care: this can only + * happen if the end result can't fit in u64 anyway. + * + * So the code below does + * + * res += (rtime / total) * stime; + * rtime = rtime % total; + */ + div = div64_u64_rem(rtime, total, &rem); + res += div * stime; + rtime = rem; + continue; + } -drop_precision: - /* We drop from rtime, it has more bits than stime */ + /* drop precision */ rtime >>= 1; total >>= 1; + if (!total) + return res; } - /* - * Make sure gcc understands that this is a 32x32->64 multiply, - * followed by a 64/32->64 divide. - */ - scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total); - return scaled; + return res + div64_u64(stime * rtime, total); } /* -- 2.5.0